Code Room
On-callHardoc-g655
Subject Storage replica divergence silentLevel Senior–Staff~40 minCommon in Storage & CDN · Distributed systems interviewsIndustries Technology

Question

A multi-replica key-value store (quorum reads/writes, RF=3) shows no errors and good latency, but a data-quality alarm fires: a background consistency checker found that for ~0.3% of keys, the three replicas hold three different values, and quorum reads are non-deterministically returning whichever two agree (sometimes the wrong pair). Dashboards: read/write error rate is 0; hinted-handoff / anti-entropy repair queue has been disabled for 11 days (someone turned it off during an unrelated incident and never re-enabled it); a node was briefly down and came back two weeks ago; write timestamps on the divergent keys cluster around that node's downtime. How do you triage silent replica divergence on durable data with no error symptoms?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.