Read repair across replicas

A read is a chance to notice that a replica fell behind — and to quietly hand it the fresh value on the way out.

The idea

In a leaderless, quorum-replicated store, the same key lives on several replicas at once. Writes don't always reach every one of them — a node was rebooting, a packet dropped — so some replicas keep a stale value while others move on. The data has quietly diverged.

Read repair fixes that on the read path. When a client reads the key, the coordinator asks several replicas, compares their versions, returns the freshest value to the client, and — this is the repair — asynchronously writes that fresh value back to whichever replicas were behind. Divergence is healed lazily, exactly for the keys people actually read, instead of waiting for a full background sweep.

Three replicas hold the same key, but one of them missed the last write.

How it works

The coordinator never assumes the replicas agree. It fans the read out, waits for a quorum of replies, and compares their version metadata to decide which value is newest. After it answers the client, it spends a little asynchronous effort writing that newest value back to any replica that came back behind.

def coordinator_read(key):
    replies = fan_out_read(key, replicas)     # ask several replicas
    quorum  = wait_for(replies, R)            # need R responses

    winner = max(quorum, key=lambda r: r.version)   # freshest wins
    return_to_client(winner.value)            # answer first, don't block on repair

    # read repair: heal whoever is behind, in the background
    for r in quorum:
        if r.version < winner.version:
            async_write_back(r.node, key, winner.value, winner.version)

Note the ordering: the client gets its answer immediately, and the write-back happens off the critical path. Repair makes the read a touch more expensive in the background, but it costs the caller no extra latency.

Signals

Choice	Buys you	Costs you
Read repair on the read path	Hot keys self-heal lazily, no scheduled sweep	Cold, unread keys stay divergent indefinitely
Pick max version, write back	Stale replicas converge toward the freshest value	Extra write traffic generated by reads
Version / vector-clock metadata	A principled way to say which value is newest	Storage and bookkeeping per key, plus clock-skew risk
Pair with anti-entropy / hinted handoff	Background sweep catches the keys reads never touch	More moving parts to operate and reason about

Watch out for

Read repair only heals keys that get read. A cold key on a stale replica stays wrong until a read touches it — or anti-entropy sweeps it.
It is not a replacement for anti-entropy. You still need a background reconciliation path for the long tail of unread keys.
Concurrent writes need real conflict resolution. If two writes truly raced, "highest version wins" can silently drop one — vector clocks surface the conflict instead of guessing.
Wall-clock timestamps invite trouble. Clock skew between nodes can make an older write look newer; prefer logical version counters where you can.
Write amplification on reads. Every read that finds divergence emits write-backs — a read-heavy workload over many stale replicas can become surprisingly write-heavy.

Worked example

The key color is updated to v3 "blue", but at that moment R2 is down, so it keeps the old v1 "red". R1 and R3 both take the new value. Later a client reads color. The coordinator fans out, collects R1=v3, R2=v1, R3=v3, and picks the max version — v3 "blue" — to return. Noticing R2 replied with v1, it asynchronously writes v3 "blue" back to R2. R2 catches up, and now all three replicas read v3 "blue". The very act of reading the key healed it.

Check yourself

After this read, R2 is fixed — but a different key that nobody read is still stale on R2. Why?

Coach note: read repair piggybacks on reads, so coverage follows traffic, not the whole keyspace. That's exactly why a background anti-entropy sweep still earns its keep. Take another pass if that pairing feels fuzzy — it's the heart of why both exist.