On-callMediumoc-g472

Subject Bad rolloutLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A JIT-compiled (JVM) recommendation service rolls out. The readiness probe is an HTTP check that passes as soon as the app context loads. During the rollout, each newly-ready pod immediately receives its full share of production traffic and spends its first ~45 seconds at p99 = 6s (JIT warmup + cold caches + connection-pool fill) before settling to 120ms. With 30 pods rolling 3 at a time, these warmup windows stack into a sustained ~8-minute latency bump that breaches the p99 SLO. Dashboards: per-new-pod, the first ~45s show high latency then flat; error rate is near zero; CPU on fresh pods spikes during warmup. Request rate is normal. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.