Connection pool exhaustion

A slow dependency quietly holds every connection hostage until new requests have nowhere to go.

The idea

Opening a fresh database connection is expensive, so services keep a small, fixed-size pool of them — say five. A request borrows a connection, does its work, and returns it for the next request to reuse.

That works beautifully while each request finishes fast. But if a downstream query gets slow, every borrowed connection is held longer. New requests queue up waiting for a free one. Once all five are checked out and the wait queue fills, the next caller can't acquire a connection at all — that's pool exhaustion. The cure is to bound the wait and fix the slow call, not to grow the pool to infinity.

See it work

Press play, or step through it.

How it works

The single most important defence is an acquire timeout: a bounded wait for a free connection. If none frees up in, say, 250 ms, the caller fails fast and sheds load instead of hanging forever and piling up threads. Pair that with try/finally (or try-with-resources) so a connection is always returned, even on error.

// A bounded pool: 5 connections, fail fast if none frees up.
HikariConfig cfg = new HikariConfig();
cfg.setMaximumPoolSize(5);
cfg.setConnectionTimeout(250);   // acquire timeout: wait at most 250 ms

DataSource pool = new HikariDataSource(cfg);

void handle(Request r) {
  // getConnection() blocks up to 250 ms, then THROWS instead of hanging.
  try (Connection c = pool.getConnection()) {   // returned automatically
    runQuery(c, r);
  } catch (SQLTransientConnectionException timeout) {
    // No connection was free in time. Shed load: 503, don't queue forever.
    respond(503, "busy, retry shortly");
  }
  // The real cure is making runQuery() fast (or circuit-breaking the
  // slow dependency) — NOT setting maxPoolSize to infinity, which just
  // moves the bottleneck onto the database and overloads it.
}

Cost and trade-offs

LeverEffectRisk
Tiny pool Low resource use; protects the database from overload Easy to exhaust under any latency spike or burst
Huge pool Absorbs slowness so requests rarely wait Hides the real problem, burns memory, can overload the DB past its max_connections
Acquire timeout Callers fail fast and shed load instead of hanging Returns errors during incidents; needs caller retry and backoff
Circuit breaker on the slow dependency Stops sending traffic into the slow path, freeing the pool to recover Drops a feature while open; needs tuning and a half-open probe
Fix the slow query Removes the root cause; restores real throughput Slowest to ship; the others only buy you time

Watch out for

Worked example

Pool of 5. A normal request holds a connection for 20 ms. Each connection can serve 1000 / 20 = 50 requests per second, so the pool's capacity is 5 × 50 = 250 req/s. Comfortable.

Now a downstream call slows to 800 ms. Each connection now serves only 1000 / 800 ≈ 1.25 requests per second, so capacity collapses to 5 × 1.25 ≈ 6 req/s — a 40× drop. At 250 req/s of incoming traffic against ~6 req/s of capacity, the wait queue fills in milliseconds. p99 latency balloons toward the full hold time, and once the queue is full the acquire timeout starts firing: callers get fast 503s instead of hung threads. Right-sizing the pool wouldn't have saved you here — only making that 800 ms call fast (or breaking the circuit) restores real capacity.

Check yourself

Your pool is exhausted because a downstream call suddenly got slow and is holding every connection. What's the best first move?