Preventing a single broken service from bringing down the entire company.
In a microservices architecture, services call other services. If the "Payment Service" gets slow and stops responding, the "Order Service" will sit there waiting. If 1,000 users try to order, the Order Service runs out of memory waiting and also crashes. This is a Cascading Failure.
To prevent this, resilient systems use Timeouts (don't wait forever), bounded Retries (try again, but only 3 times), and a Circuit Breaker (if it fails 10 times in a row, stop trying for a minute so the Payment Service can recover, and return an immediate error to the user).
def place_order():
try:
# If the circuit is OPEN, this fails instantly without network call!
# If CLOSED, it attempts the call with a STRICT 2-second timeout.
result = circuit_breaker.call(
func=payment_svc.charge,
timeout=2.0
)
return "Success"
except CircuitBreakerOpenError:
return "Payments currently unavailable. Please try later."
except TimeoutError:
return "Payment timed out."