Question
Support escalates that thousands of users' account balances went to 0 about 25 minutes ago. You confirm it in the database: a large swath of rows in the `accounts` table were overwritten with bad values. A migration job deployed 30 minutes ago ran an UPDATE without its intended WHERE clause and touched far more rows than expected; it has since finished. The data is replicated and backed up, you have continuous WAL archiving (PITR available), and writes are still flowing from live traffic. How do you triage, recover the correct data, and minimize further damage and downtime?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.