Question
A feature-flag rollout enables a new 'enriched location' input path for 20% of traffic to your delivery-ETA model. Shortly after the flag ramps, the ETA model's predictions for that 20% go haywire — wildly large ETAs — while the other 80% are normal. The model serves 200s, no errors, normal latency. Dashboards: requests on the flagged path are passing a 'distance_km' feature that is sometimes negative or in the hundreds of thousands; the new enrichment code computes distance from a lat/lng pair but the flag's code path swaps latitude and longitude (and occasionally passes raw meters where the model expects km). The flag is at 20% and scheduled to ramp to 100% in an hour. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.