Code Room
On-callHard
Question
Your service processes long-poll/streaming requests that routinely stay open 90-120 seconds. You set a generous 180-second connection-drain (graceful termination) period so scale-in wouldn't drop work. Yet after every evening scale-in you still see a burst of mid-stream disconnects and a small number of partial writes to your ledger that later need reconciliation. The drain window is clearly longer than a request, the pods get SIGTERM and the LB stops sending new traffic, and CPU is fine. Why is in-flight work still being dropped, and how do you fix it?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.