A restore that stalls

The progress bar climbed, then froze — and the volume sat in read-only limbo while the job still claimed to be running.

The idea

Restoring from a snapshot walks a pipeline: validate the snapshot, allocate the target, rehydrate the blocks, swap pointers, then bring the volume online. While it runs, the volume usually stays read-only so writes don’t fight the restore.

The trap is a restore that is running but no longer progressing. If rehydrate gets wedged on a single block — a cold object in a rate-limited store, or a block that isn’t there — a naive retry loop will spin on it forever. The percentage flatlines, the service stays degraded, and the status still cheerfully says “restoring.”

Press play to watch a restore start, stall, get triaged, and recover.

How it works

The bug is in the rehydrate loop: it fetches each block from the object store and retries on failure with no timeout and no ceiling. One cold or missing block — blk-1042 here — makes the loop retry the same fetch endlessly, so the job never advances past it. The fix bounds the retries, times out each fetch, and falls back to a secondary source.

# Buggy: unbounded retry on one block — no timeout, no skip
for blk in snapshot.blocks:
    while True:                       # spins forever on a wedged block
        data = object_store.get(blk)  # no timeout — can hang or 429 loop
        if data:
            target.write(blk, data)
            break                     # blk-1042 never returns → stuck at 62%

# Fixed: bounded retries, per-op timeout, fallback source
for blk in snapshot.blocks:
    for attempt in range(MAX_RETRIES):           # ceiling, not forever
        data = object_store.get(blk, timeout=2)  # per-op timeout
        if data:
            break
        sleep(backoff(attempt))                  # retry with backoff
    else:
        data = secondary_store.get(blk)          # fall back to another source
    target.write(blk, data)

With a timeout the wedged fetch fails fast instead of hanging; with a ceiling the loop stops re-trying the same block; with a fallback the one missing block is pulled from elsewhere. Progress resumes, pointers swap, and the volume comes online.

Cost & signals

Dimension	What to know
User impact	Volume sits read-only the whole time the restore is wedged — degraded, not down, but not recovering
Signal	Progress percentage flatlines — same number for many minutes while status still reads “running”
Signal	Block-fetch throughput drops to ~0 even though the job is “active” — watch the rate, not the status
Signal	The retry counter on a single block climbs without bound while every other block sits idle
Duration	Elapsed restore time blows far past the estimate implied by the snapshot’s size

Watch out for

Unbounded retry on a single block. A while True with no ceiling turns one bad block into a permanent stall.
No per-operation timeout. A fetch that can hang or 429-loop indefinitely lets the whole restore wedge behind one request.
No fallback fetch source. If the only source for a block is unreachable, the restore has nowhere else to turn — keep a secondary.
Treating “running” as “progressing.” A job can report active while the progress rate is zero. Alert on the rate, not just the status flag.
Restoring onto the only live copy. Restoring in place with no fallback means a stuck restore takes the service down with it — keep reads served from a replica.

Worked example

A restore from last night’s snapshot kicks off and climbs cleanly to 62% through validate and allocate, then sticks. The on-call is paged: the progress percentage has been frozen at 62% for eight minutes, and the block-fetch rate has dropped to roughly 0/s even though the job status still says “running” — confirming it’s stuck, not slow. To contain, the team holds new writes and routes reads to a healthy replica so customers stay served rather than fully down. Digging in, they find the rehydrate loop wedged on blk-1042: a cold block in a rate-limited object store, retried forever with no timeout. The fix adds a per-block timeout, a bounded retry-with-backoff, and a secondary source; blk-1042 is fetched from the replica store, the restore resumes past 62%, swaps pointers, and the volume goes online at 100%.

Check yourself

A restore job’s status reads “running” but the percentage hasn’t moved in minutes. What confirms it’s stuck rather than slow?

What is the most direct root-cause fix for a rehydrate loop wedged on one missing block?