Question
Your primary DB's AZ had a sudden hardware failure and the managed service auto-failed-over to an async standby in another AZ. Failover completed in ~90s and the app recovered — but now support is escalating: a handful of customers say a payment they completed right before the outage is 'missing,' and your idempotency/dedup table is missing the last few seconds of rows. Replication was asynchronous, and the standby was a couple seconds behind the primary at the moment it died. The promoted node is now serving traffic and accumulating new writes on top of the gap. Triage and decide how to handle the lost-write window.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.