Question
A 5-node distributed storage cluster (Raft-replicated key-value store) backing your billing ledger experiences a 90-second network partition that splits it 3|2. After the network heals, on-call sees: reads for some keys return different values depending on which node serves them; the cluster reports two nodes claiming to have been leader during the partition window; `applied_index` diverges between the two former groups; a small set of writes accepted during the partition is now in question. Latency is fine; the problem is correctness. Some customers were double-charged. How do you triage and resolve a suspected split-brain / replica divergence on durable data?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.