Code Room
On-callHard
Question
A 30-second network blip in one AZ drops ~200k persistent WebSocket connections to your realtime gateway. The moment connectivity returns, the gateway fleet's CPU and accept-queue saturate, the auth service behind it gets a synchronized request spike and starts timing out, and the whole reconnect cycle keeps failing and retrying in lockstep — the dashboards show CPU oscillating in sharp synchronized waves every few seconds. The clients use a fixed 1-second reconnect interval. How do you triage and break the cycle?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.