Code Room
On-callHard
Question
Over ~6 hours after a deploy, your service slowly degrades: new client connections start getting refused, logs show `Too many open files` and `accept: too many open files`, and the process's open file-descriptor count (from `/proc/<pid>/fd`) has climbed steadily to its 65536 `ulimit -n` ceiling and plateaued. Request rate is flat and normal — this isn't a traffic event. Restarting an instance fixes it for a few hours, then it recurs. The deploy added a new outbound integration that makes HTTP calls. How do you triage and find the leak?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.