On-callMediumoc-g563

Subject Model servingLevel Mid–Senior~30 minCommon in ML systems · Reliability & on-call interviewsIndustries Technology

Question

You're paged on the recommendations serving tier. At 14:05 the p99 latency on POST /v1/recommend jumped from 90ms to 2.1s; p50 is still flat at 40ms. Error rate is normal, throughput unchanged. Dashboards: the model server's GPU utilization is pinned near 100% (was ~55%), the dynamic-batching queue-wait histogram has a fat tail, and inflight requests are up 4x. A deploy 30 minutes ago bumped the served model from a distilled checkpoint to the full-size one to 'improve quality.' How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.