A restore that stalls

The progress bar climbed, then froze — and the volume sat in read-only limbo while the job still claimed to be running.

The idea

Restoring from a snapshot walks a pipeline: validate the snapshot, allocate the target, rehydrate the blocks, swap pointers, then bring the volume online. While it runs, the volume usually stays read-only so writes don’t fight the restore.

The trap is a restore that is running but no longer progressing. If rehydrate gets wedged on a single block — a cold object in a rate-limited store, or a block that isn’t there — a naive retry loop will spin on it forever. The percentage flatlines, the service stays degraded, and the status still cheerfully says “restoring.”

Press play to watch a restore start, stall, get triaged, and recover.

How it works

The bug is in the rehydrate loop: it fetches each block from the object store and retries on failure with no timeout and no ceiling. One cold or missing block — blk-1042 here — makes the loop retry the same fetch endlessly, so the job never advances past it. The fix bounds the retries, times out each fetch, and falls back to a secondary source.

# Buggy: unbounded retry on one block — no timeout, no skip
for blk in snapshot.blocks:
    while True:                       # spins forever on a wedged block
        data = object_store.get(blk)  # no timeout — can hang or 429 loop
        if data:
            target.write(blk, data)
            break                     # blk-1042 never returns → stuck at 62%

# Fixed: bounded retries, per-op timeout, fallback source
for blk in snapshot.blocks:
    for attempt in range(MAX_RETRIES):           # ceiling, not forever
        data = object_store.get(blk, timeout=2)  # per-op timeout
        if data:
            break
        sleep(backoff(attempt))                  # retry with backoff
    else:
        data = secondary_store.get(blk)          # fall back to another source
    target.write(blk, data)

With a timeout the wedged fetch fails fast instead of hanging; with a ceiling the loop stops re-trying the same block; with a fallback the one missing block is pulled from elsewhere. Progress resumes, pointers swap, and the volume comes online.

Cost & signals

DimensionWhat to know
User impactVolume sits read-only the whole time the restore is wedged — degraded, not down, but not recovering
SignalProgress percentage flatlines — same number for many minutes while status still reads “running”
SignalBlock-fetch throughput drops to ~0 even though the job is “active” — watch the rate, not the status
SignalThe retry counter on a single block climbs without bound while every other block sits idle
DurationElapsed restore time blows far past the estimate implied by the snapshot’s size

Watch out for

Worked example

A restore from last night’s snapshot kicks off and climbs cleanly to 62% through validate and allocate, then sticks. The on-call is paged: the progress percentage has been frozen at 62% for eight minutes, and the block-fetch rate has dropped to roughly 0/s even though the job status still says “running” — confirming it’s stuck, not slow. To contain, the team holds new writes and routes reads to a healthy replica so customers stay served rather than fully down. Digging in, they find the rehydrate loop wedged on blk-1042: a cold block in a rate-limited object store, retried forever with no timeout. The fix adds a per-block timeout, a bounded retry-with-backoff, and a secondary source; blk-1042 is fetched from the replica store, the restore resumes past 62%, swaps pointers, and the volume goes online at 100%.

Check yourself

A restore job’s status reads “running” but the percentage hasn’t moved in minutes. What confirms it’s stuck rather than slow?

What is the most direct root-cause fix for a rehydrate loop wedged on one missing block?