On-callMediumoc-g279

Subject Deploy incidentsLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A JVM service deploys with a rolling update. Each time a new pod comes up, there's a 20–30 second window where requests routed to it return elevated 503s and very high latency, then that pod normalizes. Across the full rollout these windows stack up into a noticeable error-rate bump and a few breached SLOs. Dashboards: per-new-pod, the first ~25s show high latency + 503s, then flat; the readiness probe is a TCP check on the port; the pod accepts the port (and thus traffic) immediately at process start, before JIT warmup / connection-pool init / first-request class loading complete. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.