On-callMediumoc-g103

Subject Cold startLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

During a rolling deploy of your JVM service, every freshly deployed instance shows a 60–90 second window of high p99 latency and elevated errors right after it starts taking traffic, then settles to normal. The load balancer adds each new instance to rotation as soon as its `/health` returns 200, which happens within 2 seconds of process start. Heap and GC look fine after warmup; the spike is only at the start of each instance's life. Caches are empty and the JIT hasn't compiled hot paths yet. How do you triage and stop deploys from causing latency spikes?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.