The progress bar climbed, then froze — and the volume sat in read-only limbo while the job still claimed to be running.
Restoring from a snapshot walks a pipeline: validate the snapshot, allocate the target, rehydrate the blocks, swap pointers, then bring the volume online. While it runs, the volume usually stays read-only so writes don’t fight the restore.
The trap is a restore that is running but no longer progressing. If rehydrate gets wedged on a single block — a cold object in a rate-limited store, or a block that isn’t there — a naive retry loop will spin on it forever. The percentage flatlines, the service stays degraded, and the status still cheerfully says “restoring.”
The bug is in the rehydrate loop: it fetches each block from the object store and retries on failure with no timeout and no ceiling. One cold or missing block — blk-1042 here — makes the loop retry the same fetch endlessly, so the job never advances past it. The fix bounds the retries, times out each fetch, and falls back to a secondary source.
# Buggy: unbounded retry on one block — no timeout, no skip
for blk in snapshot.blocks:
while True: # spins forever on a wedged block
data = object_store.get(blk) # no timeout — can hang or 429 loop
if data:
target.write(blk, data)
break # blk-1042 never returns → stuck at 62%
# Fixed: bounded retries, per-op timeout, fallback source
for blk in snapshot.blocks:
for attempt in range(MAX_RETRIES): # ceiling, not forever
data = object_store.get(blk, timeout=2) # per-op timeout
if data:
break
sleep(backoff(attempt)) # retry with backoff
else:
data = secondary_store.get(blk) # fall back to another source
target.write(blk, data)
With a timeout the wedged fetch fails fast instead of hanging; with a ceiling the loop stops re-trying the same block; with a fallback the one missing block is pulled from elsewhere. Progress resumes, pointers swap, and the volume comes online.
| Dimension | What to know |
|---|---|
| User impact | Volume sits read-only the whole time the restore is wedged — degraded, not down, but not recovering |
| Signal | Progress percentage flatlines — same number for many minutes while status still reads “running” |
| Signal | Block-fetch throughput drops to ~0 even though the job is “active” — watch the rate, not the status |
| Signal | The retry counter on a single block climbs without bound while every other block sits idle |
| Duration | Elapsed restore time blows far past the estimate implied by the snapshot’s size |
while True with no ceiling turns one bad block into a permanent stall.A restore from last night’s snapshot kicks off and climbs cleanly to 62% through validate and allocate, then sticks. The on-call is paged: the progress percentage has been frozen at 62% for eight minutes, and the block-fetch rate has dropped to roughly 0/s even though the job status still says “running” — confirming it’s stuck, not slow. To contain, the team holds new writes and routes reads to a healthy replica so customers stay served rather than fully down. Digging in, they find the rehydrate loop wedged on blk-1042: a cold block in a rate-limited object store, retried forever with no timeout. The fix adds a per-block timeout, a bounded retry-with-backoff, and a secondary source; blk-1042 is fetched from the replica store, the restore resumes past 62%, swaps pointers, and the volume goes online at 100%.
A restore job’s status reads “running” but the percentage hasn’t moved in minutes. What confirms it’s stuck rather than slow?
What is the most direct root-cause fix for a rehydrate loop wedged on one missing block?