How a tiny blip in one service can accidentally DDOS your entire system via retries.
In distributed systems, failures are infectious. If Service C gets slightly slow, Service A and B hit a timeout. A naive developer configures A and B to "just retry on failure." Suddenly, A and B are sending 3x the normal traffic to C.
Service C, already struggling, is crushed by the Retry Storm (Thundering Herd) and goes down completely. Now A and B are stuck waiting on C, exhausting their own thread pools, and they crash too. This is a Cascading Failure. To stop it, you must use Load Shedding, Circuit Breakers, and Exponential Backoff with Jitter.
# BAD: Naive Retry (Creates a Storm)
def fetch_data():
for _ in range(3):
try:
return db.query("...") # Hits DB aggressively 3 times on timeout
except Timeout:
continue
# GOOD: Circuit Breaker + Load Shedding
# If DB fails 5 times, OPEN the circuit. Immediately return an error
# to callers for the next 30 seconds WITHOUT hitting the DB.
# This gives the DB time to recover.
@circuit_breaker(failures=5, timeout=30)
def fetch_data():
return db.query("...")