On-callHardoc-g094

Subject Thundering herdLevel Senior–Staff~40 minCommon in Networking & APIs interviewsIndustries Technology

Question

A 30-second network blip in one AZ drops ~200k persistent WebSocket connections to your realtime gateway. The moment connectivity returns, the gateway fleet's CPU and accept-queue saturate, the auth service behind it gets a synchronized request spike and starts timing out, and the whole reconnect cycle keeps failing and retrying in lockstep — the dashboards show CPU oscillating in sharp synchronized waves every few seconds. The clients use a fixed 1-second reconnect interval. How do you triage and break the cycle?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.