Every open() hands back a file descriptor from a finite per-process table — forget to close() and the table fills until the process can open nothing more.
A process keeps a small, fixed-size table of file descriptors (integers like 3, 4, 5…). Every open file, socket, and pipe consumes one slot, and the limit is capped by ulimit -n — often 1024 in a container.
If a hot path opens descriptors but does not close() them — a missing close in an error branch, no try/finally, a pooled connection never returned — the open count climbs and never comes back down. When the last slot is gone, the very next open() or accept() fails with EMFILE (“too many open files”), even though disk and memory are fine.
Leaky mode climbs monotonically to a full table and EMFILE. The fixed mode opens and closes within each request, so the count hovers near the baseline forever.
A descriptor is just an index into the kernel’s per-process open-file table. open(), socket(), accept(), and pipe() all allocate the lowest free index and return it; close() releases it back. The table is bounded by the soft limit from ulimit -n (and a system-wide cap). Leak descriptors faster than you close them and the count rises with no ceiling but the limit.
The bug almost always lives on an error path: the happy path closes, but an exception jumps over the close. The fix is to make the close unconditional — try/finally, a context manager, RAII, or defer.
# LEAK — close() is skipped whenever the read raises
def handle(path):
f = open(path) # grabs an fd
data = f.read() # if this throws, we jump past close()
f.close() # never runs on the error path -> fd leaked
return data
# FIX — the context manager closes the fd on every exit, error or not
def handle(path):
with open(path) as f: # __exit__ always calls close()
return f.read() # exception still unwinds, but fd is released
The same shape applies to sockets (try { … } finally { sock.close() }), to Go (defer conn.Close()), and to C++ (a destructor closing the fd via RAII). The rule is identical: the release must not depend on reaching the end of the happy path.
| Signal | What you see | How to detect it |
|---|---|---|
| Open-fd count climbs | Monotonic rise that never recovers, roughly tracking request volume | ls /proc/PID/fd | wc -l over time, or lsof -p PID | wc -l |
| EMFILE errors | open() / socket() return -1 with errno EMFILE; logs read “too many open files” | Grep logs for EMFILE / “too many open files” |
accept() failing | New connections rejected; the listen socket still up but the server stops taking work | Rising accept errors; connection-refused at the edge |
| Health checks failing | Load balancer marks the instance unhealthy while CPU and memory look fine | Health endpoint times out though host metrics are flat |
| Distance to the limit | The ceiling that triggers EMFILE | ulimit -n; cat /proc/PID/limits for “Max open files” |
The tell is the shape: a leak shows fd count rising in lockstep with traffic while CPU, memory, and disk stay flat. That decoupling — busy table, quiet host — points straight at descriptors, not load.
open() followed by a close later in the function leaks on every exception in between. Make the close unconditional so it runs on the error path too.return, thrown exception) ties up its underlying socket fd until the pool is exhausted.accept() creates another. Forgetting to close the accepted connection leaks one fd per request even though the listener looks fine.ulimit -n in containers. A 1024 default makes any small leak fatal quickly. Raising the limit buys time but does not fix the leak — it only moves the cliff.epoll, timerfd, eventfd, inotify, and pipes all consume fds too. A leaked timer or epoll instance exhausts the table just like a leaked file.A JSON API runs comfortably at 1024 fds for months. During a marketing push, traffic triples and within twenty minutes instances start returning 500s; the load balancer drains them one by one. CPU sits at 30%, memory is flat — but ls /proc/PID/fd | wc -l climbs about one fd per request and never drops.
# Root cause: a metrics file opened per request, closed only on success
def record(event):
f = open("/var/log/metrics.ndjson", "a")
line = serialize(event) # raises ValueError on malformed events
f.write(line + "\n")
f.close() # skipped whenever serialize() throws
# Under the spike, malformed events became common -> close() skipped ->
# one fd leaked per bad request -> table filled -> open()/accept() = EMFILE.
# Fix: close on every path.
def record(event):
with open("/var/log/metrics.ndjson", "a") as f:
f.write(serialize(event) + "\n") # fd released even if this throws
After the with fix, the fd count flattened at its steady-state baseline and held there regardless of how many malformed events arrived. The instances stopped tripping health checks even at the higher load — the leak, not the load, had been the problem.
An instance returns EMFILE from accept(), yet CPU and memory dashboards are calm. What is most likely happening?
Your handler opens a file and closes it at the end, but it still leaks under load. Where is the fd most likely escaping?