Silent replica divergence

Three copies of the same key quietly drift apart while every read still looks happy.

The idea

A replicated store keeps several copies of each key so a node can fail without losing data. The catch is that replication is best-effort: a write that should reach all replicas can miss one because a message was dropped or a network partition flickered. Nothing throws an exception — the missed replica simply keeps its old value.

Over time the replicas diverge: one holds v7, another v6, another v5, and reads from any of them still “succeed.” The store looks healthy on every liveness check, yet a read routed to the stale replica returns old data with no signal. The cure is an active comparison — anti-entropy or read-repair — that diffs checksums across replicas and heals the laggards.

Press play to watch three replicas drift apart, then get repaired.

How it works

Each replica stores a value plus a version and a checksum (a short digest of the bytes). They stay in sync only as long as every write lands everywhere. To catch silent drift you don’t trust acknowledgements — you periodically compare digests across replicas, pick the newest version, and copy it back to the stale ones.

# Anti-entropy / read-repair: compare digests, heal the laggards
def read_repair(key, replicas):
    copies = [r.read(key) for r in replicas]        # (version, checksum, value)
    newest = max(copies, key=lambda c: c.version)   # newest by version, not arrival

    for r, c in zip(replicas, copies):
        if c.checksum != newest.checksum:           # digests disagree -> divergence
            r.write(key, newest.value, newest.version)
            metrics.read_repairs.inc()              # should trend toward zero
    return newest.value

Because the comparison is on the stored bytes, not on a delivery receipt, it notices drift even when every replica reported its earlier write as a success. A climbing repair count is your early warning that replication is dropping updates.

Cost & signals

Dimension	What to know
Why it’s silent	A missed write raises no error; the stale replica answers reads just fine
Signal	Checksums for the same key differ across replicas at the same logical time
Signal	Read-repair / anti-entropy repair counts climbing instead of trending to zero
Signal	Version vectors that never converge — replicas stuck at different versions
Fix cost	Low per repair, but unbounded drift means more reads served stale meanwhile

Watch out for

No anti-entropy or read-repair. Without an active compare-and-heal pass, drift only grows; nothing pulls replicas back together.
Last-writer-wins with no versioning. Picking “newest” by wall clock or arrival order can resurrect stale data; compare versions or vector clocks.
Ignoring dropped replication acks. A write that only reached two of three replicas is a divergence in waiting — count and retry the misses.
Monitoring liveness, not consistency. Every node can be up and answering while quietly serving different values. Alert on digest mismatch, not just heartbeats.
Reading from one replica only. A single-replica read can’t detect that its peers disagree; quorum reads surface the conflict.

Worked example

A session store keeps three replicas of user:42. A profile update bumps it to v6 and reaches R1 and R2, but the message to R3 is dropped during a brief partition — no error surfaces, so the app moves on. A later update reaches only R1, leaving R1=v7, R2=v6, R3=v5. The user’s next request is load-balanced to R3 and they see a two-versions-old profile, with no log line to explain it. The fix already existed but was disabled: a nightly anti-entropy scrub compares checksums, finds R2 and R3 disagree with R1’s v7, and copies v7 back to both. Re-enabling it — and alerting when the repair count rises — turns a silent data bug into a visible, self-healing one.

Check yourself

Why can replicas diverge without any error being raised?

What actually detects silent divergence?