Thundering restore storm

When everyone reaches for the fire exit at once, the door becomes the fire — pace the crowd or nobody gets out.

The idea

Something goes wrong — a bad deploy, a region blip — and suddenly thousands of jobs all decide to restore from backup at the same instant. Each restore is a heavy read from the storage backend. All of them arriving together saturate the backend, so every restore slows down, times out, and retries, which adds even more load. That feedback loop is a thundering herd turning into a storm.

You don't fix this by going faster — the backend is already maxed. You fix it by triaging (find and stop the source of the surge) and containing it: cap concurrency, queue the rest, add jitter so retries don't re-synchronise, and let a single fetch satisfy many waiters. The goal is to convert a spike into a steady, survivable trickle.

Press Play to watch an uncontained storm, then the contained response.

How it works

The storm phase: everyone fires at once, the backend overloads, timeouts trigger synchronised retries, load stays pinned. The contain phase: a semaphore caps in-flight restores, excess requests queue, retries get random jitter so they spread out, and a single-flight cache means duplicate restores of the same backup share one fetch.

# Contain the herd: bounded concurrency + jitter + single-flight
sem = Semaphore(MAX_INFLIGHT)        # cap concurrent restores
inflight = {}                        # backup_id -> shared future

def restore(backup_id):
    if backup_id in inflight:        # single-flight: dedup identical work
        return inflight[backup_id].result()
    with sem:                        # queue beyond the cap, don't pile on
        fut = inflight[backup_id] = run_restore(backup_id)
        try:
            return fut.result()
        finally:
            inflight.pop(backup_id, None)

def retry_after(attempt):            # de-synchronise retries
    base = min(CAP, 2 ** attempt)
    return base * (0.5 + random.random())   # full jitter, not a fixed backoff

Triage first, though: if a runaway client or bad health check is causing the restores, stopping it removes the load at the source — far better than absorbing it.

Signals

Signal	Reading
Restore rate spikes vertically	Synchronised herd, not organic growth
Backend latency & errors climb together	Saturation, then timeout-driven retries
Retry count > original request count	Retry amplification feeding the storm
Many restores of the same backup id	Single-flight would collapse them
Load drops the instant you cap concurrency	Containment is working

Watch out for

Adding capacity mid-storm. New nodes get hammered too and you've just made a bigger fire; cap concurrency first.
Fixed backoff. If everyone retries after exactly 1s, the herd just re-synchronises one second later. Use full jitter.
No single-flight. A thousand jobs restoring the same backup do a thousand identical reads; collapse them into one.
Skipping triage. If a misbehaving caller is generating the restores, contain treats the symptom while the source keeps firing.
Unbounded retries. Cap attempts and shed load past a point — a failed restore now is better than a downed backend for everyone.

Worked example

A bad config push makes 3,000 workers crash-loop; each restart triggers a restore. All 3,000 hit the backend within seconds, latency jumps from 50 ms to 30 s, restores time out and retry, and load pins at 100%. Triage: you spot the crash-loop source and roll back the config — new restores stop. Contain: a semaphore of 50 drains the backlog in waves, full-jitter backoff spreads the retries, and single-flight collapses the many identical restores of the same base image into one. Latency falls back under a second and the queue empties cleanly.

Check yourself

The backend is saturated by a restore storm. Which move helps most right now?