On-callHardoc-g624

Subject Replication failoverLevel Senior–Staff~45 minCommon in Databases & SQL · Concurrency · Reliability & on-call · Distributed systems interviewsIndustries Technology

Question

Your primary DB's AZ had a sudden hardware failure and the managed service auto-failed-over to an async standby in another AZ. Failover completed in ~90s and the app recovered — but now support is escalating: a handful of customers say a payment they completed right before the outage is 'missing,' and your idempotency/dedup table is missing the last few seconds of rows. Replication was asynchronous, and the standby was a couple seconds behind the primary at the moment it died. The promoted node is now serving traffic and accumulating new writes on top of the gap. Triage and decide how to handle the lost-write window.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.