A rebalance that halts with hot nodes still hot

The plan was right — even out the load — but the moves never drained, so the overloaded node stayed overloaded.

The idea

When one storage node carries far more load than its peers, a balancer plans shard moves from the hot node onto cold ones so capacity and traffic even out. Each move is a queued job: copy a shard to a target, then drop the old copy.

The failure here is subtle. The plan looks healthy — the queue is full of pending moves — but progress is zero. A move is blocked: the target has no free space, the concurrency throttle is near zero, a failed move is head-of-line blocking the queue, or a constraint leaves no valid placement. So the hot node stays hot and tail latency stays high. Triage means spotting that moves are pending but not progressing, finding why, and unblocking so the queue drains.

Press Play, or step through the rebalance one move at a time.

How it works

A safe balancer never blindly dequeues a move. For each pending move it checks two gates first: does the target have free space (with headroom kept above a floor), and is there a free move-slot under the concurrency limit. A move that fails is retried with backoff or skipped — it must never head-of-line block the rest of the queue.

def drain_queue(queue, nodes, max_concurrent, min_free):
    running = 0
    for move in list(queue):          # don't head-of-line block:
                                      # consider every move, not just the front
        if running >= max_concurrent: # throttle == 0 means nothing ever runs
            break
        tgt = nodes[move.target]
        # target-full deadlock: you may need to move data OFF tgt first
        if tgt.free < move.size + min_free:
            continue                  # skip for now, revisit next pass
        if not placement_valid(move, nodes):  # anti-affinity / constraint
            continue
        try:
            running += 1
            execute(move)             # copy shard, then drop source
            queue.remove(move)
        except MoveError:
            move.retries += 1
            if move.retries > RETRY_MAX:
                queue.remove(move)    # skip; alert; don't block the queue
            # else leave queued for a later pass with backoff

Cost

What it costsWhile stuck
User-facing impactHot-node tail latency (p99) stays elevated — the imbalance never lifts
Move throughputBounded by the concurrency throttle; throttle near zero means moves crawl or stop
Durability / availabilityStaying imbalanced concentrates risk: one hot node failing loses or stalls more
Time to balanceUnbounded while blocked; only the remediation restarts the clock
Signal to watchPending-moves count not decreasing, move-queue age climbing, one move retrying forever, target free-space hitting its floor

Watch out for

Worked example

An eight-node cluster. Node 1 sits at 95% and runs hot while the others idle near 40%. The balancer plans 12 moves off node 1. The first five complete and node 1 starts cooling. Then move 6 stalls: its target, node 4, is itself near full, so there is no room to land the shard.

Because the scheduler retries the head of the queue, moves 7 through 12 never even get a turn — pending sits stuck at 7 while done stays frozen at 5. Triage reads the gauges: pending isn’t falling, queue age is climbing, and node 4 free-space is at the floor. You free space on node 4 (or reorder so a smaller target takes the shard, or raise concurrency). The gate now passes, the queue drains, the remaining moves land, every node settles near 55%, and node 1’s tail latency drops back to baseline.

Check yourself

1. The dashboard shows 7 moves pending but done hasn’t increased in ten minutes. What is the most useful first read?

2. Move 6 keeps failing because its target node is near full, and moves 7–12 never run. What unblocks the queue most directly?