Code Room
On-callMediumoc-g337
Subject Migration gone wrongLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A migration renames the `status` enum on a `shipments` table: the integer code `3` changed meaning from 'in_transit' to 'delivered' (and a new code `5` now means 'in_transit') to match a new carrier API. The producer service was deployed with the new mapping at 14:00. By 14:30 customers get 'your package was delivered' emails and SMS for packages still in transit, and a partner webhook is firing 'delivered' events early. Dashboards: error rate flat, throughput normal. Some downstream consumers were NOT redeployed. How do you triage, stop the wrong notifications, and reconcile state?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.