Error handling & resilience

Preventing a single broken service from bringing down the entire company.

The idea

In a microservices architecture, services call other services. If the "Payment Service" gets slow and stops responding, the "Order Service" will sit there waiting. If 1,000 users try to order, the Order Service runs out of memory waiting and also crashes. This is a Cascading Failure.

To prevent this, resilient systems use Timeouts (don't wait forever), bounded Retries (try again, but only 3 times), and a Circuit Breaker (if it fails 10 times in a row, stop trying for a minute so the Payment Service can recover, and return an immediate error to the user).

Order Svc Active Threads: 0 Payment Svc Status: OUTAGE Circuit Breaker OPEN!
Payment Service is down. How will the Order Service handle it?

How it works (Circuit Breaker Pattern)

def place_order():
    try:
        # If the circuit is OPEN, this fails instantly without network call!
        # If CLOSED, it attempts the call with a STRICT 2-second timeout.
        result = circuit_breaker.call(
            func=payment_svc.charge,
            timeout=2.0
        )
        return "Success"
        
    except CircuitBreakerOpenError:
        return "Payments currently unavailable. Please try later."
        
    except TimeoutError:
        return "Payment timed out."