Question
A managed MySQL cluster (semi/async replication, automated failover) failed over at 02:50 when the primary's host became unreachable. A replica was promoted and writes resumed in ~30s. At 09:00, reconciliation finds a few hundred rows where the *old* primary and the *new* primary disagree: the old primary has writes from 02:48–02:50 that the new primary never received, and a couple of rows were written on *both* with different values. The old host has since rejoined as a replica. Dashboards: replication healthy now, no current lag. How do you triage, prevent the divergence from spreading, and reconcile the data safely?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.