Code Room
On-callMedium
Question
Every time your JVM-based recommendation service rolls out (rolling deploy, 20 pods, ~2 min apart), the fleet-wide p99 spikes to 2-3x for the first 60-90 seconds per pod, causing brief error-budget burn, even though steady-state latency is fine. The newly started pods show very high CPU and slow responses immediately after they start taking traffic, tapering off after ~90s. Heap is fine, no GC storm, no dependency change. Health checks pass as soon as the port is open. How do you triage and stop the deploy-time tail-latency?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.