On-callHardoc-g670

Subject Model serving shadow traffic overloadLevel Senior–Staff~35 minCommon in ML systems interviewsIndustries Technology

Question

To validate a candidate model, a team turned on shadow traffic: production requests are mirrored to a new model whose responses are discarded. Shortly after, the LIVE production model's p99 latency rises 60% and a few requests start timing out, even though real user QPS hasn't changed. Dashboards: the shadow model runs on the SAME GPU node pool and shares the same feature-store and inference-batching infrastructure as the live model; GPU utilization and feature-store QPS roughly doubled when shadowing turned on; the shadow path has no rate limit and mirrors 100% of traffic. Tracing shows live requests now queue longer in the shared batcher. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.