Question
To validate a candidate model, a team turned on shadow traffic: production requests are mirrored to a new model whose responses are discarded. Shortly after, the LIVE production model's p99 latency rises 60% and a few requests start timing out, even though real user QPS hasn't changed. Dashboards: the shadow model runs on the SAME GPU node pool and shares the same feature-store and inference-batching infrastructure as the live model; GPU utilization and feature-store QPS roughly doubled when shadowing turned on; the shadow path has no rate limit and mirrors 100% of traffic. Tracing shows live requests now queue longer in the shared batcher. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.