Thundering restore storm

When everyone reaches for the fire exit at once, the door becomes the fire — pace the crowd or nobody gets out.

The idea

Something goes wrong — a bad deploy, a region blip — and suddenly thousands of jobs all decide to restore from backup at the same instant. Each restore is a heavy read from the storage backend. All of them arriving together saturate the backend, so every restore slows down, times out, and retries, which adds even more load. That feedback loop is a thundering herd turning into a storm.

You don't fix this by going faster — the backend is already maxed. You fix it by triaging (find and stop the source of the surge) and containing it: cap concurrency, queue the rest, add jitter so retries don't re-synchronise, and let a single fetch satisfy many waiters. The goal is to convert a spike into a steady, survivable trickle.

Restore requests Admission gate no limit Storage backend healthy load 0%
Press Play to watch an uncontained storm, then the contained response.

How it works

The storm phase: everyone fires at once, the backend overloads, timeouts trigger synchronised retries, load stays pinned. The contain phase: a semaphore caps in-flight restores, excess requests queue, retries get random jitter so they spread out, and a single-flight cache means duplicate restores of the same backup share one fetch.

# Contain the herd: bounded concurrency + jitter + single-flight
sem = Semaphore(MAX_INFLIGHT)        # cap concurrent restores
inflight = {}                        # backup_id -> shared future

def restore(backup_id):
    if backup_id in inflight:        # single-flight: dedup identical work
        return inflight[backup_id].result()
    with sem:                        # queue beyond the cap, don't pile on
        fut = inflight[backup_id] = run_restore(backup_id)
        try:
            return fut.result()
        finally:
            inflight.pop(backup_id, None)

def retry_after(attempt):            # de-synchronise retries
    base = min(CAP, 2 ** attempt)
    return base * (0.5 + random.random())   # full jitter, not a fixed backoff

Triage first, though: if a runaway client or bad health check is causing the restores, stopping it removes the load at the source — far better than absorbing it.

Signals

SignalReading
Restore rate spikes verticallySynchronised herd, not organic growth
Backend latency & errors climb togetherSaturation, then timeout-driven retries
Retry count > original request countRetry amplification feeding the storm
Many restores of the same backup idSingle-flight would collapse them
Load drops the instant you cap concurrencyContainment is working

Watch out for

Worked example

A bad config push makes 3,000 workers crash-loop; each restart triggers a restore. All 3,000 hit the backend within seconds, latency jumps from 50 ms to 30 s, restores time out and retry, and load pins at 100%. Triage: you spot the crash-loop source and roll back the config — new restores stop. Contain: a semaphore of 50 drains the backlog in waves, full-jitter backoff spreads the retries, and single-flight collapses the many identical restores of the same base image into one. Latency falls back under a second and the queue empties cleanly.

Check yourself

The backend is saturated by a restore storm. Which move helps most right now?