Keep three copies of everything; when one machine vanishes, quietly make a fresh copy somewhere healthy.
A distributed storage cluster splits your data into shards and keeps each shard on several nodes at once — a replication factor (RF) of 3 means three copies live on three different machines. If one node fails, every shard it held still survives on its other copies, but each of those shards is now under-replicated: it has fewer copies than it should.
The cluster notices the failure through missed heartbeats (gossip timeouts), then re-homes the lost replicas. For each affected shard it picks a healthy node that doesn't already hold that shard, copies a surviving replica there, and restores the shard back to RF. When every shard is whole again, the cluster is healthy.
The cluster keeps a placement map: which nodes hold each shard. A failure detector watches heartbeats; once a node is declared down (after a grace period, so a brief blip doesn't trigger needless work), the cluster walks the placement map and repairs every shard that lost a replica.
def rebalance(placement_map, failed_node, healthy_nodes):
# Only act after the grace period: a node missing for a few
# seconds may just be a blip — re-homing on every flap causes churn.
for shard, replica_nodes in placement_map.items():
if failed_node not in replica_nodes:
continue # this shard was unaffected
replica_nodes.remove(failed_node) # now under-replicated
while len(replica_nodes) < RF:
target = pick_target(
healthy_nodes,
exclude=replica_nodes, # don't double-place on a node
exclude_domains=racks_of(replica_nodes), # spread failure domains
avoid_overloaded=True, # balance + rate-limit the rebuild
)
source = any_surviving_replica(replica_nodes)
copy_shard(shard, frm=source, to=target) # network + disk I/O
replica_nodes.append(target)
placement_map[shard] = replica_nodes # update the map
| What | Cost or signal |
|---|---|
| Detection delay | Heartbeat / gossip timeout plus a grace period before declaring the node down |
| Re-home traffic | Roughly the total size of the lost node's data — every under-replicated shard is copied once |
| Load spike | Source nodes read and send; target nodes receive and write — extra disk and network on both |
| Time to restore RF | The durability window: how long shards sit under-replicated before they're whole again |
| Health signal | Under-replicated shard count and rebuild bytes-remaining — both should trend to zero |
Picture a 6-node cluster at RF 3. Node 3 dies holding 200 shards. The instant it's declared down, those 200 shards drop to 2 copies each — all 200 are now under-replicated.
The cluster copies 200 fresh replicas onto the 5 survivors, spreading the targets so no single node takes the whole load (about 40 each), and skipping any node that already holds the shard or sits in the same rack as a surviving copy. As each copy lands, the under-replicated counter ticks down. When it hits zero, every shard is back at RF 3 and the cluster is healthy.
The risk lives in the gap between “node 3 declared down” and “all 200 restored”: during that window a second failure on the wrong node could leave a shard with a single copy — which is why the rebuild is throttled but never lazy.
1. A node misses heartbeats for two seconds, then comes back. Why hold a grace period before re-homing its shards?
2. When picking a target node for a re-homed replica, which node is a poor choice?