On-callHardoc-g107

Subject Quota exhaustionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Over ~6 hours after a deploy, your service slowly degrades: new client connections start getting refused, logs show `Too many open files` and `accept: too many open files`, and the process's open file-descriptor count (from `/proc/<pid>/fd`) has climbed steadily to its 65536 `ulimit -n` ceiling and plateaued. Request rate is flat and normal — this isn't a traffic event. Restarting an instance fixes it for a few hours, then it recurs. The deploy added a new outbound integration that makes HTTP calls. How do you triage and find the leak?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.