Question
You promote a new recommendation model to a 5% canary. Within minutes the canary's error rate climbs to ~12% (control is 0%): some requests return a 500 'invalid score' while others return an empty recommendation list. The canary and control share the same feature store and infra. Dashboards: the failing canary requests all involve users with no purchases in the last 90 days; tracing shows the canary model computes a new feature 'avg_order_value_90d' as total_spend / order_count, and for these users order_count is 0, producing NaN/Inf that propagates into the score. Control (old model) doesn't use this feature. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.