Question
A multi-replica key-value store (quorum reads/writes, RF=3) shows no errors and good latency, but a data-quality alarm fires: a background consistency checker found that for ~0.3% of keys, the three replicas hold three different values, and quorum reads are non-deterministically returning whichever two agree (sometimes the wrong pair). Dashboards: read/write error rate is 0; hinted-handoff / anti-entropy repair queue has been disabled for 11 days (someone turned it off during an unrelated incident and never re-enabled it); a node was briefly down and came back two weeks ago; write timestamps on the divergent keys cluster around that node's downtime. How do you triage silent replica divergence on durable data with no error symptoms?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.