On-callMediumoc-g575

Subject Model serving cold startLevel Mid–Senior~30 minCommon in ML systems interviewsIndustries Technology

Question

During a routine rolling deploy of your translation inference service, the error rate spikes to 8% for ~6 minutes each time a batch of pods cycles, then recovers. Symptom repeats on every rollout. Dashboards: newly started pods report Ready and receive traffic immediately, but their first ~90 seconds show very high latency and timeouts; the model artifact is a multi-GB file pulled from object storage at startup, and the first inference triggers lazy GPU kernel/weight loading. p99 on warm pods is fine. Triage and fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.