Three copies of the same key quietly drift apart while every read still looks happy.
A replicated store keeps several copies of each key so a node can fail without losing data. The catch is that replication is best-effort: a write that should reach all replicas can miss one because a message was dropped or a network partition flickered. Nothing throws an exception — the missed replica simply keeps its old value.
Over time the replicas diverge: one holds v7, another v6, another v5, and reads from any of them still “succeed.” The store looks healthy on every liveness check, yet a read routed to the stale replica returns old data with no signal. The cure is an active comparison — anti-entropy or read-repair — that diffs checksums across replicas and heals the laggards.
Each replica stores a value plus a version and a checksum (a short digest of the bytes). They stay in sync only as long as every write lands everywhere. To catch silent drift you don’t trust acknowledgements — you periodically compare digests across replicas, pick the newest version, and copy it back to the stale ones.
# Anti-entropy / read-repair: compare digests, heal the laggards
def read_repair(key, replicas):
copies = [r.read(key) for r in replicas] # (version, checksum, value)
newest = max(copies, key=lambda c: c.version) # newest by version, not arrival
for r, c in zip(replicas, copies):
if c.checksum != newest.checksum: # digests disagree -> divergence
r.write(key, newest.value, newest.version)
metrics.read_repairs.inc() # should trend toward zero
return newest.value
Because the comparison is on the stored bytes, not on a delivery receipt, it notices drift even when every replica reported its earlier write as a success. A climbing repair count is your early warning that replication is dropping updates.
| Dimension | What to know |
|---|---|
| Why it’s silent | A missed write raises no error; the stale replica answers reads just fine |
| Signal | Checksums for the same key differ across replicas at the same logical time |
| Signal | Read-repair / anti-entropy repair counts climbing instead of trending to zero |
| Signal | Version vectors that never converge — replicas stuck at different versions |
| Fix cost | Low per repair, but unbounded drift means more reads served stale meanwhile |
A session store keeps three replicas of user:42. A profile update bumps it to v6 and reaches R1 and R2, but the message to R3 is dropped during a brief partition — no error surfaces, so the app moves on. A later update reaches only R1, leaving R1=v7, R2=v6, R3=v5. The user’s next request is load-balanced to R3 and they see a two-versions-old profile, with no log line to explain it. The fix already existed but was disabled: a nightly anti-entropy scrub compares checksums, finds R2 and R3 disagree with R1’s v7, and copies v7 back to both. Re-enabling it — and alerting when the repair count rises — turns a silent data bug into a visible, self-healing one.
Why can replicas diverge without any error being raised?
What actually detects silent divergence?