On-callHardoc-g312

Subject Cold startLevel Senior–Staff~28 minCommon in Reliability & on-call interviewsIndustries Technology

Question

You ship a routine deploy of your JVM service at evening peak using a rolling update with maxSurge=50%, maxUnavailable=0. Within a minute p99 spikes from 120ms to 5s, error rate hits 8%, and upstream circuit breakers trip. CPU on the newly rolled pods is pinned at 100% while old pods are fine. The deploy keeps progressing batch by batch, and each new batch produces a fresh latency spike — a rolling wave. Cold pods take ~90s to JIT-compile hot paths and warm connection pools before they perform normally; meanwhile load balancers send them a full share of peak traffic immediately. Last week's off-peak deploy was clean. How do you triage and stabilize?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.