On-callHardoc-g338

Subject Migration gone wrongLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

You're migrating a `wallets` table from an old Postgres store to a new sharded store using a dual-write + gradual read-cutover: a router decides per `account_id` whether reads go to old or new based on a 'migrated' flag set when each account's data is copied. At 16:00 you bumped the read-cutover from 10% to 60% of accounts. Within minutes, support reports a slice of users seeing *stale balances* — a deposit made an hour ago is missing on refresh, then reappears. Dashboards: error rate flat, both stores healthy, replication current. How do you triage, stop serving stale data, and reconcile?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.