On-callHardoc-g549

Subject On callLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

At 09:15 a support escalation says ~40,000 user accounts suddenly have `subscription_tier = 'free'` who were paying customers yesterday. Billing reconciliation flagged it. There's no outage — the app is up and fast. Audit logs show a single batch UPDATE ran at 08:50 from a one-off data-migration script an engineer ran to 'fix a tier-naming inconsistency'; the WHERE clause was broader than intended and clobbered the `subscription_tier` column for far more rows than the few hundred targeted. The script did not back up the affected rows. Stripe still shows these customers as active subscribers. Your primary Postgres has continuous WAL archiving and a base backup from 02:00; the bad write was 6.5 hours after the last base backup. How do you recover the correct data with minimal collateral damage, and how do you prevent this class of incident?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.