When everyone reaches for the fire exit at once, the door becomes the fire — pace the crowd or nobody gets out.
Something goes wrong — a bad deploy, a region blip — and suddenly thousands of jobs all decide to restore from backup at the same instant. Each restore is a heavy read from the storage backend. All of them arriving together saturate the backend, so every restore slows down, times out, and retries, which adds even more load. That feedback loop is a thundering herd turning into a storm.
You don't fix this by going faster — the backend is already maxed. You fix it by triaging (find and stop the source of the surge) and containing it: cap concurrency, queue the rest, add jitter so retries don't re-synchronise, and let a single fetch satisfy many waiters. The goal is to convert a spike into a steady, survivable trickle.
The storm phase: everyone fires at once, the backend overloads, timeouts trigger synchronised retries, load stays pinned. The contain phase: a semaphore caps in-flight restores, excess requests queue, retries get random jitter so they spread out, and a single-flight cache means duplicate restores of the same backup share one fetch.
# Contain the herd: bounded concurrency + jitter + single-flight
sem = Semaphore(MAX_INFLIGHT) # cap concurrent restores
inflight = {} # backup_id -> shared future
def restore(backup_id):
if backup_id in inflight: # single-flight: dedup identical work
return inflight[backup_id].result()
with sem: # queue beyond the cap, don't pile on
fut = inflight[backup_id] = run_restore(backup_id)
try:
return fut.result()
finally:
inflight.pop(backup_id, None)
def retry_after(attempt): # de-synchronise retries
base = min(CAP, 2 ** attempt)
return base * (0.5 + random.random()) # full jitter, not a fixed backoff
Triage first, though: if a runaway client or bad health check is causing the restores, stopping it removes the load at the source — far better than absorbing it.
| Signal | Reading |
|---|---|
| Restore rate spikes vertically | Synchronised herd, not organic growth |
| Backend latency & errors climb together | Saturation, then timeout-driven retries |
| Retry count > original request count | Retry amplification feeding the storm |
| Many restores of the same backup id | Single-flight would collapse them |
| Load drops the instant you cap concurrency | Containment is working |
A bad config push makes 3,000 workers crash-loop; each restart triggers a restore. All 3,000 hit the backend within seconds, latency jumps from 50 ms to 30 s, restores time out and retry, and load pins at 100%. Triage: you spot the crash-loop source and roll back the config — new restores stop. Contain: a semaphore of 50 drains the backlog in waves, full-jitter backoff spreads the retries, and single-flight collapses the many identical restores of the same base image into one. Latency falls back under a second and the queue empties cleanly.
The backend is saturated by a restore storm. Which move helps most right now?