When every worker is stuck waiting on something slow, the queue keeps growing and new work just waits — even though the CPU is barely doing anything.
A service usually serves requests from a fixed pool of worker threads. Each worker takes a task off a queue, runs it, then comes back for the next one. That works beautifully — until each task has to block on something slow.
If a downstream dependency slows down, every worker that calls it gets stuck holding its task, waiting. Once all the workers are blocked, the queue stops draining and starts backing up, and new requests wait behind a wall of stuck workers. The dangerous part: the CPU may be nearly idle the whole time. The workers aren't busy computing — they're blocked, and there's no thread left to start the next thing.
Drag the slider to slow the dependency and watch the workers block. The step buttons walk the on-call story end to end.
The fix isn't more threads — it's bounding the system so it fails fast and stays predictable instead of melting down silently.
// A bounded pool that degrades gracefully instead of exhausting.
pool = WorkerPool(size = N) // fixed worker count
queue = BoundedQueue(capacity = Q) // NOT unbounded
breaker = CircuitBreaker(downstream) // trips on too many slow/failed calls
function submit(task):
if not queue.offer(task): // queue is full -> shed load now
reject(task, "overloaded") // fail fast, count it, return 503
return
function worker_loop():
while true:
task = queue.take()
if breaker.is_open(): // downstream known-bad: skip the call
fail_fast(task) // don't park a worker waiting on it
continue
try:
// per-task timeout: a slow dependency can't hold a worker forever
result = call_downstream(task, timeout = 800ms)
breaker.record_success()
complete(task, result)
catch Timeout:
breaker.record_failure() // frees the worker for the next task
fail_fast(task)
// Contrast:
// unbounded queue + no timeout -> problem HIDDEN until OOM / total stall
// bounded queue + timeout + breaker -> problem VISIBLE as rejections,
// workers stay free, core stays alive
| Symptom | What it's telling you |
|---|---|
| Pool utilization at 100%, all workers busy | No free worker to pick up new work — you're saturated, not just loaded. |
| Queue depth climbing and not draining | Tasks arrive faster than workers finish them. By Little's Law, in-flight work keeps rising. |
| Latency p99 spiking while CPU stays low | Workers are blocked, not busy. The bottleneck is downstream, not compute. |
| Rejections / 503s rising | A bounded queue is doing its job — shedding load instead of hiding it. |
| Downstream latency up at the same moment | Strong hint the root cause is a slow dependency holding every worker. |
A payment endpoint runs on a pool of 16 workers. Its downstream provider normally answers in 50ms, so each worker handles roughly 20 requests/second. At 200 req/s, Little's Law says average in-flight work is 200 × 0.05 = 10 — comfortably under 16 workers. Plenty of headroom.
Then the provider degrades to 3s. Now in-flight work needs 200 × 3 = 600 concurrent tasks, but there are only 16 workers. Within a second or two all 16 are blocked on the slow call. The queue backs up without bound, p99 latency explodes past the client timeout, and yet CPU sits around 20% — the workers are waiting, not computing. Naively bumping the pool to 64 just fires 4× the load at an already-struggling provider.
Containment: add an 800ms per-task timeout so a stuck call releases its worker instead of holding it forever; put the queue behind a bounded capacity so excess requests are shed as fast 503s rather than piling into memory; and wrap the provider in a circuit breaker that trips after a burst of timeouts, so workers stop even trying the bad dependency and stay free for healthy traffic. The endpoint now degrades to "some payments rejected, fast" instead of "everything hangs, then OOM" — and the rest of the service stays alive.
Your pool is at 100% utilization, queue depth is climbing, p99 is spiking — but CPU is steady at 18%. What's the most likely cause?
The downstream provider is slow and your pool is exhausted. Which move actually helps contain it?