On-callHardoc-g559

Subject On callLevel Senior–Staff~40 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology

Question

Your real-time presence service holds ~2 million persistent WebSocket connections across a fleet of gateway pods behind a load balancer. At 21:00 the LB has a 6-second blip (a control-plane hiccup) that drops a large fraction of connections at once. The blip is over by 21:00:06 — but the system does NOT recover. The gateway pods now CPU-saturate and crash-loop, the auth service they call on each new connection is at 100% and timing out, and the connection-count graph shows violent oscillation: connections surge, pods die, connections drop, pods come back, connections surge again. Even though the original LB blip is long gone, the fleet can't re-stabilize. Triage and explain why a 6-second blip caused a sustained outage.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.