Question
A migration renames the `status` enum on a `shipments` table: the integer code `3` changed meaning from 'in_transit' to 'delivered' (and a new code `5` now means 'in_transit') to match a new carrier API. The producer service was deployed with the new mapping at 14:00. By 14:30 customers get 'your package was delivered' emails and SMS for packages still in transit, and a partner webhook is firing 'delivered' events early. Dashboards: error rate flat, throughput normal. Some downstream consumers were NOT redeployed. How do you triage, stop the wrong notifications, and reconcile state?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.