Code Room
On-callMedium
Question
During a routine rolling deploy of your translation inference service, the error rate spikes to 8% for ~6 minutes each time a batch of pods cycles, then recovers. Symptom repeats on every rollout. Dashboards: newly started pods report Ready and receive traffic immediately, but their first ~90 seconds show very high latency and timeouts; the model artifact is a multi-GB file pulled from object storage at startup, and the first inference triggers lazy GPU kernel/weight loading. p99 on warm pods is fine. Triage and fix.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.