Cascading failures

How a tiny blip in one service can accidentally DDOS your entire system via retries.

The idea

In distributed systems, failures are infectious. If Service C gets slightly slow, Service A and B hit a timeout. A naive developer configures A and B to "just retry on failure." Suddenly, A and B are sending 3x the normal traffic to C.

Service C, already struggling, is crushed by the Retry Storm (Thundering Herd) and goes down completely. Now A and B are stuck waiting on C, exhausting their own thread pools, and they crash too. This is a Cascading Failure. To stop it, you must use Load Shedding, Circuit Breakers, and Exponential Backoff with Jitter.

API Gateway Status: OK Worker Status: OK Database Status: OK
Healthy system. Gateway and Worker talk to Database.

How it works (Breaking the Chain)

# BAD: Naive Retry (Creates a Storm)
def fetch_data():
    for _ in range(3):
        try:
            return db.query("...") # Hits DB aggressively 3 times on timeout
        except Timeout:
            continue

# GOOD: Circuit Breaker + Load Shedding
# If DB fails 5 times, OPEN the circuit. Immediately return an error
# to callers for the next 30 seconds WITHOUT hitting the DB.
# This gives the DB time to recover.
@circuit_breaker(failures=5, timeout=30)
def fetch_data():
    return db.query("...")