Code Room
On-callHardoc-g189
Subject Fd exhaustionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Software development

Question

A config-watching sidecar (Go) that hot-reloads files for a large fleet starts logging 'too many open files' and stops picking up config changes, though network requests still work fine. The process fd count is moderate (under 4k) and well below the 64k ulimit, yet adding a new watch fails. `cat /proc/sys/fs/inotify/max_user_watches` shows 8192, and the app added a recursive directory watcher last release that watches a tree that recently grew to ~9000 files after a config refactor split one big file into many. How do you triage and fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.