On-callHardoc-g584

Subject Storage corruption pitrLevel Senior–Staff~40 minCommon in Storage & CDN interviewsIndustries Technology

Question

Support escalates that thousands of users' account balances went to 0 about 25 minutes ago. You confirm it in the database: a large swath of rows in the `accounts` table were overwritten with bad values. A migration job deployed 30 minutes ago ran an UPDATE without its intended WHERE clause and touched far more rows than expected; it has since finished. The data is replicated and backed up, you have continuous WAL archiving (PITR available), and writes are still flowing from live traffic. How do you triage, recover the correct data, and minimize further damage and downtime?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.