On-callHardoc-g511

Subject Autoscaling failureLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your service processes long-poll/streaming requests that routinely stay open 90-120 seconds. You set a generous 180-second connection-drain (graceful termination) period so scale-in wouldn't drop work. Yet after every evening scale-in you still see a burst of mid-stream disconnects and a small number of partial writes to your ledger that later need reconciliation. The drain window is clearly longer than a request, the pods get SIGTERM and the LB stops sending new traffic, and CPU is fine. Why is in-flight work still being dropped, and how do you fix it?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.