A rebalance that halts with hot nodes still hot

The plan was right — even out the load — but the moves never drained, so the overloaded node stayed overloaded.

The idea

When one storage node carries far more load than its peers, a balancer plans shard moves from the hot node onto cold ones so capacity and traffic even out. Each move is a queued job: copy a shard to a target, then drop the old copy.

The failure here is subtle. The plan looks healthy — the queue is full of pending moves — but progress is zero. A move is blocked: the target has no free space, the concurrency throttle is near zero, a failed move is head-of-line blocking the queue, or a constraint leaves no valid placement. So the hot node stays hot and tail latency stays high. Triage means spotting that moves are pending but not progressing, finding why, and unblocking so the queue drains.

Press Play, or step through the rebalance one move at a time.

How it works

A safe balancer never blindly dequeues a move. For each pending move it checks two gates first: does the target have free space (with headroom kept above a floor), and is there a free move-slot under the concurrency limit. A move that fails is retried with backoff or skipped — it must never head-of-line block the rest of the queue.

def drain_queue(queue, nodes, max_concurrent, min_free):
    running = 0
    for move in list(queue):          # don't head-of-line block:
                                      # consider every move, not just the front
        if running >= max_concurrent: # throttle == 0 means nothing ever runs
            break
        tgt = nodes[move.target]
        # target-full deadlock: you may need to move data OFF tgt first
        if tgt.free < move.size + min_free:
            continue                  # skip for now, revisit next pass
        if not placement_valid(move, nodes):  # anti-affinity / constraint
            continue
        try:
            running += 1
            execute(move)             # copy shard, then drop source
            queue.remove(move)
        except MoveError:
            move.retries += 1
            if move.retries > RETRY_MAX:
                queue.remove(move)    # skip; alert; don't block the queue
            # else leave queued for a later pass with backoff

Cost

What it costs	While stuck
User-facing impact	Hot-node tail latency (p99) stays elevated — the imbalance never lifts
Move throughput	Bounded by the concurrency throttle; throttle near zero means moves crawl or stop
Durability / availability	Staying imbalanced concentrates risk: one hot node failing loses or stalls more
Time to balance	Unbounded while blocked; only the remediation restarts the clock
Signal to watch	Pending-moves count not decreasing, move-queue age climbing, one move retrying forever, target free-space hitting its floor

Watch out for

Throttle set too low. A move concurrency of one (or zero) means the rebalance never catches up with live write churn — the node stays hot forever. Size the throttle to outpace incoming skew.
The target-full deadlock. You need free space on the target to move data there, but freeing that space needs its own move. Sequence moves, keep a min-free headroom on every node, or evacuate before you fill.
One failing move blocking the queue. If the scheduler retries the same bad move at the head forever, nothing behind it runs. Skip or retry with backoff — never head-of-line block.
Over-tight constraints. Anti-affinity or placement rules so strict that no valid target exists make the planner produce moves that can never execute. Relax the constraint or add capacity.
Rebalancing too aggressively. The opposite failure: moves so fast they steal IO from live traffic. Balancing is a background job — it should yield.
No alert on stuck moves. If pending-moves-not-decreasing pages nobody, the cluster silently stays hot. Alert on queue age, not just queue length.

Worked example

An eight-node cluster. Node 1 sits at 95% and runs hot while the others idle near 40%. The balancer plans 12 moves off node 1. The first five complete and node 1 starts cooling. Then move 6 stalls: its target, node 4, is itself near full, so there is no room to land the shard.

Because the scheduler retries the head of the queue, moves 7 through 12 never even get a turn — pending sits stuck at 7 while done stays frozen at 5. Triage reads the gauges: pending isn’t falling, queue age is climbing, and node 4 free-space is at the floor. You free space on node 4 (or reorder so a smaller target takes the shard, or raise concurrency). The gate now passes, the queue drains, the remaining moves land, every node settles near 55%, and node 1’s tail latency drops back to baseline.

Check yourself

1. The dashboard shows 7 moves pending but done hasn’t increased in ten minutes. What is the most useful first read?

2. Move 6 keeps failing because its target node is near full, and moves 7–12 never run. What unblocks the queue most directly?