On-callHardoc-g668

Subject Inference canary nan featureLevel Senior–Staff~35 minCommon in ML systems · Reliability & on-call interviewsIndustries Technology

Question

You promote a new recommendation model to a 5% canary. Within minutes the canary's error rate climbs to ~12% (control is 0%): some requests return a 500 'invalid score' while others return an empty recommendation list. The canary and control share the same feature store and infra. Dashboards: the failing canary requests all involve users with no purchases in the last 90 days; tracing shows the canary model computes a new feature 'avg_order_value_90d' as total_spend / order_count, and for these users order_count is 0, producing NaN/Inf that propagates into the score. Control (old model) doesn't use this feature. How do you triage and respond?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.