Cascading failures

How a tiny blip in one service can accidentally DDOS your entire system via retries.

The idea

In distributed systems, failures are infectious. If Service C gets slightly slow, Service A and B hit a timeout. A naive developer configures A and B to "just retry on failure." Suddenly, A and B are sending 3x the normal traffic to C.

Service C, already struggling, is crushed by the Retry Storm (Thundering Herd) and goes down completely. Now A and B are stuck waiting on C, exhausting their own thread pools, and they crash too. This is a Cascading Failure. To stop it, you must use Load Shedding, Circuit Breakers, and Exponential Backoff with Jitter.

Healthy system. Gateway and Worker talk to Database.

How it works (Breaking the Chain)

# BAD: Naive Retry (Creates a Storm)
def fetch_data():
    for _ in range(3):
        try:
            return db.query("...") # Hits DB aggressively 3 times on timeout
        except Timeout:
            continue

# GOOD: Circuit Breaker + Load Shedding
# If DB fails 5 times, OPEN the circuit. Immediately return an error
# to callers for the next 30 seconds WITHOUT hitting the DB.
# This gives the DB time to recover.
@circuit_breaker(failures=5, timeout=30)
def fetch_data():
    return db.query("...")