The plan was right — even out the load — but the moves never drained, so the overloaded node stayed overloaded.
When one storage node carries far more load than its peers, a balancer plans shard moves from the hot node onto cold ones so capacity and traffic even out. Each move is a queued job: copy a shard to a target, then drop the old copy.
The failure here is subtle. The plan looks healthy — the queue is full of pending moves — but progress is zero. A move is blocked: the target has no free space, the concurrency throttle is near zero, a failed move is head-of-line blocking the queue, or a constraint leaves no valid placement. So the hot node stays hot and tail latency stays high. Triage means spotting that moves are pending but not progressing, finding why, and unblocking so the queue drains.
A safe balancer never blindly dequeues a move. For each pending move it checks two gates first: does the target have free space (with headroom kept above a floor), and is there a free move-slot under the concurrency limit. A move that fails is retried with backoff or skipped — it must never head-of-line block the rest of the queue.
def drain_queue(queue, nodes, max_concurrent, min_free):
running = 0
for move in list(queue): # don't head-of-line block:
# consider every move, not just the front
if running >= max_concurrent: # throttle == 0 means nothing ever runs
break
tgt = nodes[move.target]
# target-full deadlock: you may need to move data OFF tgt first
if tgt.free < move.size + min_free:
continue # skip for now, revisit next pass
if not placement_valid(move, nodes): # anti-affinity / constraint
continue
try:
running += 1
execute(move) # copy shard, then drop source
queue.remove(move)
except MoveError:
move.retries += 1
if move.retries > RETRY_MAX:
queue.remove(move) # skip; alert; don't block the queue
# else leave queued for a later pass with backoff
| What it costs | While stuck |
|---|---|
| User-facing impact | Hot-node tail latency (p99) stays elevated — the imbalance never lifts |
| Move throughput | Bounded by the concurrency throttle; throttle near zero means moves crawl or stop |
| Durability / availability | Staying imbalanced concentrates risk: one hot node failing loses or stalls more |
| Time to balance | Unbounded while blocked; only the remediation restarts the clock |
| Signal to watch | Pending-moves count not decreasing, move-queue age climbing, one move retrying forever, target free-space hitting its floor |
An eight-node cluster. Node 1 sits at 95% and runs hot while the others idle near 40%. The balancer plans 12 moves off node 1. The first five complete and node 1 starts cooling. Then move 6 stalls: its target, node 4, is itself near full, so there is no room to land the shard.
Because the scheduler retries the head of the queue, moves 7 through 12 never even get a turn — pending sits stuck at 7 while done stays frozen at 5. Triage reads the gauges: pending isn’t falling, queue age is climbing, and node 4 free-space is at the floor. You free space on node 4 (or reorder so a smaller target takes the shard, or raise concurrency). The gate now passes, the queue drains, the remaining moves land, every node settles near 55%, and node 1’s tail latency drops back to baseline.
1. The dashboard shows 7 moves pending but done hasn’t increased in ten minutes. What is the most useful first read?
2. Move 6 keeps failing because its target node is near full, and moves 7–12 never run. What unblocks the queue most directly?