Worker thread pool exhaustion

When every worker is stuck waiting on something slow, the queue keeps growing and new work just waits — even though the CPU is barely doing anything.

The idea

A service usually serves requests from a fixed pool of worker threads. Each worker takes a task off a queue, runs it, then comes back for the next one. That works beautifully — until each task has to block on something slow.

If a downstream dependency slows down, every worker that calls it gets stuck holding its task, waiting. Once all the workers are blocked, the queue stops draining and starts backing up, and new requests wait behind a wall of stuck workers. The dangerous part: the CPU may be nearly idle the whole time. The workers aren't busy computing — they're blocked, and there's no thread left to start the next thing.

See it work

A healthy pool: downstream is fast, so workers finish quickly and the queue stays short.

downstream latency fast (50ms)

Drag the slider to slow the dependency and watch the workers block. The step buttons walk the on-call story end to end.

How it works

The fix isn't more threads — it's bounding the system so it fails fast and stays predictable instead of melting down silently.

// A bounded pool that degrades gracefully instead of exhausting.

pool    = WorkerPool(size = N)          // fixed worker count
queue   = BoundedQueue(capacity = Q)    // NOT unbounded
breaker = CircuitBreaker(downstream)    // trips on too many slow/failed calls

function submit(task):
    if not queue.offer(task):           // queue is full -> shed load now
        reject(task, "overloaded")      // fail fast, count it, return 503
        return

function worker_loop():
    while true:
        task = queue.take()
        if breaker.is_open():           // downstream known-bad: skip the call
            fail_fast(task)             // don't park a worker waiting on it
            continue
        try:
            // per-task timeout: a slow dependency can't hold a worker forever
            result = call_downstream(task, timeout = 800ms)
            breaker.record_success()
            complete(task, result)
        catch Timeout:
            breaker.record_failure()    // frees the worker for the next task
            fail_fast(task)

// Contrast:
//   unbounded queue + no timeout      -> problem HIDDEN until OOM / total stall
//   bounded queue + timeout + breaker -> problem VISIBLE as rejections,
//                                        workers stay free, core stays alive

Signals

Symptom	What it's telling you
Pool utilization at 100%, all workers busy	No free worker to pick up new work — you're saturated, not just loaded.
Queue depth climbing and not draining	Tasks arrive faster than workers finish them. By Little's Law, in-flight work keeps rising.
Latency p99 spiking while CPU stays low	Workers are blocked, not busy. The bottleneck is downstream, not compute.
Rejections / 503s rising	A bounded queue is doing its job — shedding load instead of hiding it.
Downstream latency up at the same moment	Strong hint the root cause is a slow dependency holding every worker.

Watch out for

An unbounded queue masks the saturation — latency creeps up while memory fills, and the first hard signal you get is an out-of-memory crash.
No per-task timeout means a slow dependency holds a worker forever. One stuck call permanently removes a thread from the pool.
One slow dependency starves all workers. Without a bulkhead to isolate it, an unrelated, healthy endpoint sharing the pool goes down too.
Adding threads doesn't help when you're blocked on downstream — you just fire more concurrent calls at the thing that's already overloaded, often making it worse.
Sizing the pool by CPU count when the work is I/O-bound. A 95% wait / 5% compute task needs far more workers than cores — or, better, non-blocking I/O.

Worked example

A payment endpoint runs on a pool of 16 workers. Its downstream provider normally answers in 50ms, so each worker handles roughly 20 requests/second. At 200 req/s, Little's Law says average in-flight work is 200 × 0.05 = 10 — comfortably under 16 workers. Plenty of headroom.

Then the provider degrades to 3s. Now in-flight work needs 200 × 3 = 600 concurrent tasks, but there are only 16 workers. Within a second or two all 16 are blocked on the slow call. The queue backs up without bound, p99 latency explodes past the client timeout, and yet CPU sits around 20% — the workers are waiting, not computing. Naively bumping the pool to 64 just fires 4× the load at an already-struggling provider.

Containment: add an 800ms per-task timeout so a stuck call releases its worker instead of holding it forever; put the queue behind a bounded capacity so excess requests are shed as fast 503s rather than piling into memory; and wrap the provider in a circuit breaker that trips after a burst of timeouts, so workers stop even trying the bad dependency and stay free for healthy traffic. The endpoint now degrades to "some payments rejected, fast" instead of "everything hangs, then OOM" — and the rest of the service stays alive.

Check yourself

Your pool is at 100% utilization, queue depth is climbing, p99 is spiking — but CPU is steady at 18%. What's the most likely cause?

The downstream provider is slow and your pool is exhausted. Which move actually helps contain it?