Question
A JIT-compiled (JVM) recommendation service rolls out. The readiness probe is an HTTP check that passes as soon as the app context loads. During the rollout, each newly-ready pod immediately receives its full share of production traffic and spends its first ~45 seconds at p99 = 6s (JIT warmup + cold caches + connection-pool fill) before settling to 120ms. With 30 pods rolling 3 at a time, these warmup windows stack into a sustained ~8-minute latency bump that breaches the p99 SLO. Dashboards: per-new-pod, the first ~45s show high latency then flat; error rate is near zero; CPU on fresh pods spikes during warmup. Request rate is normal. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.