On-callMediumoc-g245

Subject Tail latencyLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Every time your JVM-based recommendation service rolls out (rolling deploy, 20 pods, ~2 min apart), the fleet-wide p99 spikes to 2-3x for the first 60-90 seconds per pod, causing brief error-budget burn, even though steady-state latency is fine. The newly started pods show very high CPU and slow responses immediately after they start taking traffic, tapering off after ~90s. Heap is fine, no GC storm, no dependency change. Health checks pass as soon as the port is open. How do you triage and stop the deploy-time tail-latency?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.