Question
Your real-time presence service holds ~2 million persistent WebSocket connections across a fleet of gateway pods behind a load balancer. At 21:00 the LB has a 6-second blip (a control-plane hiccup) that drops a large fraction of connections at once. The blip is over by 21:00:06 — but the system does NOT recover. The gateway pods now CPU-saturate and crash-loop, the auth service they call on each new connection is at 100% and timing out, and the connection-count graph shows violent oscillation: connections surge, pods die, connections drop, pods come back, connections surge again. Even though the original LB blip is long gone, the fleet can't re-stabilize. Triage and explain why a 6-second blip caused a sustained outage.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.