Silent replica divergence

Three copies of the same key quietly drift apart while every read still looks happy.

The idea

A replicated store keeps several copies of each key so a node can fail without losing data. The catch is that replication is best-effort: a write that should reach all replicas can miss one because a message was dropped or a network partition flickered. Nothing throws an exception — the missed replica simply keeps its old value.

Over time the replicas diverge: one holds v7, another v6, another v5, and reads from any of them still “succeed.” The store looks healthy on every liveness check, yet a read routed to the stale replica returns old data with no signal. The cure is an active comparison — anti-entropy or read-repair — that diffs checksums across replicas and heals the laggards.

Press play to watch three replicas drift apart, then get repaired.

How it works

Each replica stores a value plus a version and a checksum (a short digest of the bytes). They stay in sync only as long as every write lands everywhere. To catch silent drift you don’t trust acknowledgements — you periodically compare digests across replicas, pick the newest version, and copy it back to the stale ones.

# Anti-entropy / read-repair: compare digests, heal the laggards
def read_repair(key, replicas):
    copies = [r.read(key) for r in replicas]        # (version, checksum, value)
    newest = max(copies, key=lambda c: c.version)   # newest by version, not arrival

    for r, c in zip(replicas, copies):
        if c.checksum != newest.checksum:           # digests disagree -> divergence
            r.write(key, newest.value, newest.version)
            metrics.read_repairs.inc()              # should trend toward zero
    return newest.value

Because the comparison is on the stored bytes, not on a delivery receipt, it notices drift even when every replica reported its earlier write as a success. A climbing repair count is your early warning that replication is dropping updates.

Cost & signals

DimensionWhat to know
Why it’s silentA missed write raises no error; the stale replica answers reads just fine
SignalChecksums for the same key differ across replicas at the same logical time
SignalRead-repair / anti-entropy repair counts climbing instead of trending to zero
SignalVersion vectors that never converge — replicas stuck at different versions
Fix costLow per repair, but unbounded drift means more reads served stale meanwhile

Watch out for

Worked example

A session store keeps three replicas of user:42. A profile update bumps it to v6 and reaches R1 and R2, but the message to R3 is dropped during a brief partition — no error surfaces, so the app moves on. A later update reaches only R1, leaving R1=v7, R2=v6, R3=v5. The user’s next request is load-balanced to R3 and they see a two-versions-old profile, with no log line to explain it. The fix already existed but was disabled: a nightly anti-entropy scrub compares checksums, finds R2 and R3 disagree with R1’s v7, and copies v7 back to both. Re-enabling it — and alerting when the repair count rises — turns a silent data bug into a visible, self-healing one.

Check yourself

Why can replicas diverge without any error being raised?

What actually detects silent divergence?