File descriptor leaks

Every open() hands back a file descriptor from a finite per-process table — forget to close() and the table fills until the process can open nothing more.

The idea

A process keeps a small, fixed-size table of file descriptors (integers like 3, 4, 5…). Every open file, socket, and pipe consumes one slot, and the limit is capped by ulimit -n — often 1024 in a container.

If a hot path opens descriptors but does not close() them — a missing close in an error branch, no try/finally, a pooled connection never returned — the open count climbs and never comes back down. When the last slot is gone, the very next open() or accept() fails with EMFILE (“too many open files”), even though disk and memory are fine.

choose a mode, then Play or Step
A fresh process: descriptors 0, 1, 2 are stdin/stdout/stderr; the rest of the table is free. Pick a mode and press Play.

Leaky mode climbs monotonically to a full table and EMFILE. The fixed mode opens and closes within each request, so the count hovers near the baseline forever.

How it works

A descriptor is just an index into the kernel’s per-process open-file table. open(), socket(), accept(), and pipe() all allocate the lowest free index and return it; close() releases it back. The table is bounded by the soft limit from ulimit -n (and a system-wide cap). Leak descriptors faster than you close them and the count rises with no ceiling but the limit.

The bug almost always lives on an error path: the happy path closes, but an exception jumps over the close. The fix is to make the close unconditional — try/finally, a context manager, RAII, or defer.

# LEAK — close() is skipped whenever the read raises
def handle(path):
    f = open(path)            # grabs an fd
    data = f.read()           # if this throws, we jump past close()
    f.close()                 # never runs on the error path -> fd leaked
    return data

# FIX — the context manager closes the fd on every exit, error or not
def handle(path):
    with open(path) as f:     # __exit__ always calls close()
        return f.read()       # exception still unwinds, but fd is released

The same shape applies to sockets (try { … } finally { sock.close() }), to Go (defer conn.Close()), and to C++ (a destructor closing the fd via RAII). The rule is identical: the release must not depend on reaching the end of the happy path.

Signals

SignalWhat you seeHow to detect it
Open-fd count climbsMonotonic rise that never recovers, roughly tracking request volumels /proc/PID/fd | wc -l over time, or lsof -p PID | wc -l
EMFILE errorsopen() / socket() return -1 with errno EMFILE; logs read “too many open files”Grep logs for EMFILE / “too many open files”
accept() failingNew connections rejected; the listen socket still up but the server stops taking workRising accept errors; connection-refused at the edge
Health checks failingLoad balancer marks the instance unhealthy while CPU and memory look fineHealth endpoint times out though host metrics are flat
Distance to the limitThe ceiling that triggers EMFILEulimit -n; cat /proc/PID/limits for “Max open files”

The tell is the shape: a leak shows fd count rising in lockstep with traffic while CPU, memory, and disk stay flat. That decoupling — busy table, quiet host — points straight at descriptors, not load.

Watch out for

Worked example

A JSON API runs comfortably at 1024 fds for months. During a marketing push, traffic triples and within twenty minutes instances start returning 500s; the load balancer drains them one by one. CPU sits at 30%, memory is flat — but ls /proc/PID/fd | wc -l climbs about one fd per request and never drops.

# Root cause: a metrics file opened per request, closed only on success
def record(event):
    f = open("/var/log/metrics.ndjson", "a")
    line = serialize(event)        # raises ValueError on malformed events
    f.write(line + "\n")
    f.close()                      # skipped whenever serialize() throws

# Under the spike, malformed events became common -> close() skipped ->
# one fd leaked per bad request -> table filled -> open()/accept() = EMFILE.

# Fix: close on every path.
def record(event):
    with open("/var/log/metrics.ndjson", "a") as f:
        f.write(serialize(event) + "\n")   # fd released even if this throws

After the with fix, the fd count flattened at its steady-state baseline and held there regardless of how many malformed events arrived. The instances stopped tripping health checks even at the higher load — the leak, not the load, had been the problem.

Check yourself

An instance returns EMFILE from accept(), yet CPU and memory dashboards are calm. What is most likely happening?

Your handler opens a file and closes it at the end, but it still leaks under load. Where is the fd most likely escaping?