A server thread that blocks forever on a stalled connection is a thread you'll never get back.
A classic thread-per-connection server calls accept() to take a client socket, then read() to pull the request off the wire. If the socket has no read or accept timeout, a client that connects but never sends — a half-open or slowloris connection — leaves that worker thread parked in read() indefinitely.
With a fixed thread pool, a handful of these stalled connections silently consume every worker until the pool is empty and healthy clients can no longer be served. It's a denial of service with no crash and no log line. The fix is a socket read deadline, so a stalled connection is reaped and its thread returns to the pool.
The bug is the absence of a deadline. A blocking recv() on a socket with no timeout has no upper bound — it waits as long as the peer keeps the connection open but silent.
# BUG — the worker blocks here forever if the client never sends
conn, addr = srv.accept() # accept() can also block with no timeout
data = conn.recv(1024) # parks the thread indefinitely on a stalled peer
handle(data)
conn.close()
The fix gives the socket a read deadline. When the peer stays silent past the deadline, recv() raises, you close the connection, and the worker thread is freed back to the pool.
# FIX — bound the wait, reap the stall, return the thread
conn, addr = srv.accept()
conn.settimeout(5.0) # or setsockopt SO_RCVTIMEO at the OS level
try:
data = conn.recv(1024)
handle(data)
except socket.timeout:
pass # stalled peer — reap it
finally:
conn.close() # always release the socket and the thread
At the OS level this is setsockopt(SO_RCVTIMEO, …). Better still, non-blocking or async I/O (selectors, epoll, kqueue) drops the one-thread-per-connection model entirely, so a single thread multiplexes thousands of sockets and a stalled peer costs nothing.
| Aspect | Cost | Signal to watch |
|---|---|---|
| No timeout | Threads leak on every stalled connection | Pool usage climbing while CPU sits idle |
| Read timeout | Reaps stalls, but may cut slow-but-legitimate clients early | Rising count of timeout-triggered closes |
| Thread-per-connection | Simple to reason about, but bounded by pool size | Free workers trending toward zero |
| Non-blocking I/O | Scales to many idle sockets, but more complex code | Event-loop lag and ready-set size |
| Idle / connection caps | Defense in depth; rejects abusers, adds tuning | Per-IP connection counts and reject rate |
read but not the connect / accept path, or forgetting to actually close and reap the socket on timeout, leaves the leak in place.Picture a server with a 50-thread pool fronting real users. An attacker opens 50 slowloris connections that complete the TCP handshake and then send nothing. Each recv() blocks with no deadline, so one by one all 50 workers park forever. Free threads hit zero; the next legitimate user is queued or refused. CPU is near idle, nothing crashes, and no log fires — the same drain-to-zero you can step through in the animation above.
Now add conn.settimeout(5.0). Five seconds after each stalled recv() starts, it raises socket.timeout, the connection is closed, and that worker returns to the pool. The 50 attacker sockets churn harmlessly in and out, free threads stay healthy, and real users keep getting served — phase B of the visual.
1. Your thread-per-connection server stops responding to new clients. CPU is near idle, nothing has crashed, and there are no errors in the logs. What's the most likely cause?
2. You add a read deadline to reap stalled sockets. Which detail matters most so the thread is actually recovered?