When a node goes down, its shards re-home

Keep three copies of everything; when one machine vanishes, quietly make a fresh copy somewhere healthy.

The idea

A distributed storage cluster splits your data into shards and keeps each shard on several nodes at once — a replication factor (RF) of 3 means three copies live on three different machines. If one node fails, every shard it held still survives on its other copies, but each of those shards is now under-replicated: it has fewer copies than it should.

The cluster notices the failure through missed heartbeats (gossip timeouts), then re-homes the lost replicas. For each affected shard it picks a healthy node that doesn't already hold that shard, copies a surviving replica there, and restores the shard back to RF. When every shard is whole again, the cluster is healthy.

Healthy cluster — every shard has 3 copies.

How it works

The cluster keeps a placement map: which nodes hold each shard. A failure detector watches heartbeats; once a node is declared down (after a grace period, so a brief blip doesn't trigger needless work), the cluster walks the placement map and repairs every shard that lost a replica.

def rebalance(placement_map, failed_node, healthy_nodes):
    # Only act after the grace period: a node missing for a few
    # seconds may just be a blip — re-homing on every flap causes churn.
    for shard, replica_nodes in placement_map.items():
        if failed_node not in replica_nodes:
            continue                      # this shard was unaffected

        replica_nodes.remove(failed_node) # now under-replicated

        while len(replica_nodes) < RF:
            target = pick_target(
                healthy_nodes,
                exclude=replica_nodes,    # don't double-place on a node
                exclude_domains=racks_of(replica_nodes),  # spread failure domains
                avoid_overloaded=True,    # balance + rate-limit the rebuild
            )
            source = any_surviving_replica(replica_nodes)
            copy_shard(shard, frm=source, to=target)  # network + disk I/O
            replica_nodes.append(target)
            placement_map[shard] = replica_nodes      # update the map

Cost / signals

What	Cost or signal
Detection delay	Heartbeat / gossip timeout plus a grace period before declaring the node down
Re-home traffic	Roughly the total size of the lost node's data — every under-replicated shard is copied once
Load spike	Source nodes read and send; target nodes receive and write — extra disk and network on both
Time to restore RF	The durability window: how long shards sit under-replicated before they're whole again
Health signal	Under-replicated shard count and rebuild bytes-remaining — both should trend to zero

Watch out for

Re-homing on a transient blip — a node that returns in seconds shouldn't trigger a full rebuild. Use a grace period to avoid churn.
Placing a new replica on a node that already holds the shard — that copy adds no redundancy. The target must not already be in the replica set.
Putting the new copy in the same failure domain (rack, zone) as a surviving one — one rack outage could then take multiple replicas at once.
Letting rebuild traffic saturate the network and starve live reads and writes — rate-limit the copy so it doesn't storm the cluster.
Losing a second node while shards are still under-replicated — the durability window is exactly when a second failure can cause real data loss.

Worked example

Picture a 6-node cluster at RF 3. Node 3 dies holding 200 shards. The instant it's declared down, those 200 shards drop to 2 copies each — all 200 are now under-replicated.

The cluster copies 200 fresh replicas onto the 5 survivors, spreading the targets so no single node takes the whole load (about 40 each), and skipping any node that already holds the shard or sits in the same rack as a surviving copy. As each copy lands, the under-replicated counter ticks down. When it hits zero, every shard is back at RF 3 and the cluster is healthy.

The risk lives in the gap between “node 3 declared down” and “all 200 restored”: during that window a second failure on the wrong node could leave a shard with a single copy — which is why the rebuild is throttled but never lazy.

Check yourself

1. A node misses heartbeats for two seconds, then comes back. Why hold a grace period before re-homing its shards?

2. When picking a target node for a re-homed replica, which node is a poor choice?