Code Room
On-callMedium
Question
You're paged on the recommendations serving tier. At 14:05 the p99 latency on POST /v1/recommend jumped from 90ms to 2.1s; p50 is still flat at 40ms. Error rate is normal, throughput unchanged. Dashboards: the model server's GPU utilization is pinned near 100% (was ~55%), the dynamic-batching queue-wait histogram has a fat tail, and inflight requests are up 4x. A deploy 30 minutes ago bumped the served model from a distilled checkpoint to the full-size one to 'improve quality.' How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.