File descriptor leaks

Every open() hands back a file descriptor from a finite per-process table — forget to close() and the table fills until the process can open nothing more.

The idea

A process keeps a small, fixed-size table of file descriptors (integers like 3, 4, 5…). Every open file, socket, and pipe consumes one slot, and the limit is capped by ulimit -n — often 1024 in a container.

If a hot path opens descriptors but does not close() them — a missing close in an error branch, no try/finally, a pooled connection never returned — the open count climbs and never comes back down. When the last slot is gone, the very next open() or accept() fails with EMFILE (“too many open files”), even though disk and memory are fine.

choose a mode, then Play or Step

A fresh process: descriptors 0, 1, 2 are stdin/stdout/stderr; the rest of the table is free. Pick a mode and press Play.

Leaky mode climbs monotonically to a full table and EMFILE. The fixed mode opens and closes within each request, so the count hovers near the baseline forever.

How it works

A descriptor is just an index into the kernel’s per-process open-file table. open(), socket(), accept(), and pipe() all allocate the lowest free index and return it; close() releases it back. The table is bounded by the soft limit from ulimit -n (and a system-wide cap). Leak descriptors faster than you close them and the count rises with no ceiling but the limit.

The bug almost always lives on an error path: the happy path closes, but an exception jumps over the close. The fix is to make the close unconditional — try/finally, a context manager, RAII, or defer.

# LEAK — close() is skipped whenever the read raises
def handle(path):
    f = open(path)            # grabs an fd
    data = f.read()           # if this throws, we jump past close()
    f.close()                 # never runs on the error path -> fd leaked
    return data

# FIX — the context manager closes the fd on every exit, error or not
def handle(path):
    with open(path) as f:     # __exit__ always calls close()
        return f.read()       # exception still unwinds, but fd is released

The same shape applies to sockets (try { … } finally { sock.close() }), to Go (defer conn.Close()), and to C++ (a destructor closing the fd via RAII). The rule is identical: the release must not depend on reaching the end of the happy path.

Signals

Signal	What you see	How to detect it
Open-fd count climbs	Monotonic rise that never recovers, roughly tracking request volume	`ls /proc/PID/fd \| wc -l` over time, or `lsof -p PID \| wc -l`
EMFILE errors	`open()` / `socket()` return `-1` with errno `EMFILE`; logs read “too many open files”	Grep logs for `EMFILE` / “too many open files”
`accept()` failing	New connections rejected; the listen socket still up but the server stops taking work	Rising `accept` errors; connection-refused at the edge
Health checks failing	Load balancer marks the instance unhealthy while CPU and memory look fine	Health endpoint times out though host metrics are flat
Distance to the limit	The ceiling that triggers EMFILE	`ulimit -n`; `cat /proc/PID/limits` for “Max open files”

The tell is the shape: a leak shows fd count rising in lockstep with traffic while CPU, memory, and disk stay flat. That decoupling — busy table, quiet host — points straight at descriptors, not load.

Watch out for

No try/finally, with, or defer. A plain open() followed by a close later in the function leaks on every exception in between. Make the close unconditional so it runs on the error path too.
Leaking only on the error path. Tests pass because the happy path closes cleanly; the leak hides in the rare exception branch and only surfaces under real failures or a traffic spike.
Pooled connections never returned. A database or HTTP connection borrowed from a pool but not returned (early return, thrown exception) ties up its underlying socket fd until the pool is exhausted.
Double-counting sockets and their children. A listening socket is one fd, but every accept() creates another. Forgetting to close the accepted connection leaks one fd per request even though the listener looks fine.
Too-low ulimit -n in containers. A 1024 default makes any small leak fatal quickly. Raising the limit buys time but does not fix the leak — it only moves the cliff.
Forgetting non-file descriptors. epoll, timerfd, eventfd, inotify, and pipes all consume fds too. A leaked timer or epoll instance exhausts the table just like a leaked file.

Worked example

A JSON API runs comfortably at 1024 fds for months. During a marketing push, traffic triples and within twenty minutes instances start returning 500s; the load balancer drains them one by one. CPU sits at 30%, memory is flat — but ls /proc/PID/fd | wc -l climbs about one fd per request and never drops.

# Root cause: a metrics file opened per request, closed only on success
def record(event):
    f = open("/var/log/metrics.ndjson", "a")
    line = serialize(event)        # raises ValueError on malformed events
    f.write(line + "\n")
    f.close()                      # skipped whenever serialize() throws

# Under the spike, malformed events became common -> close() skipped ->
# one fd leaked per bad request -> table filled -> open()/accept() = EMFILE.

# Fix: close on every path.
def record(event):
    with open("/var/log/metrics.ndjson", "a") as f:
        f.write(serialize(event) + "\n")   # fd released even if this throws

After the with fix, the fd count flattened at its steady-state baseline and held there regardless of how many malformed events arrived. The instances stopped tripping health checks even at the higher load — the leak, not the load, had been the problem.

Check yourself

An instance returns EMFILE from accept(), yet CPU and memory dashboards are calm. What is most likely happening?

Your handler opens a file and closes it at the end, but it still leaks under load. Where is the fd most likely escaping?