Connection pool exhaustion

A slow dependency quietly holds every connection hostage until new requests have nowhere to go.

The idea

Opening a fresh database connection is expensive, so services keep a small, fixed-size pool of them — say five. A request borrows a connection, does its work, and returns it for the next request to reuse.

That works beautifully while each request finishes fast. But if a downstream query gets slow, every borrowed connection is held longer. New requests queue up waiting for a free one. Once all five are checked out and the wait queue fills, the next caller can't acquire a connection at all — that's pool exhaustion. The cure is to bound the wait and fix the slow call, not to grow the pool to infinity.

See it work

Press play, or step through it.

How it works

The single most important defence is an acquire timeout: a bounded wait for a free connection. If none frees up in, say, 250 ms, the caller fails fast and sheds load instead of hanging forever and piling up threads. Pair that with try/finally (or try-with-resources) so a connection is always returned, even on error.

// A bounded pool: 5 connections, fail fast if none frees up.
HikariConfig cfg = new HikariConfig();
cfg.setMaximumPoolSize(5);
cfg.setConnectionTimeout(250);   // acquire timeout: wait at most 250 ms

DataSource pool = new HikariDataSource(cfg);

void handle(Request r) {
  // getConnection() blocks up to 250 ms, then THROWS instead of hanging.
  try (Connection c = pool.getConnection()) {   // returned automatically
    runQuery(c, r);
  } catch (SQLTransientConnectionException timeout) {
    // No connection was free in time. Shed load: 503, don't queue forever.
    respond(503, "busy, retry shortly");
  }
  // The real cure is making runQuery() fast (or circuit-breaking the
  // slow dependency) — NOT setting maxPoolSize to infinity, which just
  // moves the bottleneck onto the database and overloads it.
}

Cost and trade-offs

Lever	Effect	Risk
Tiny pool	Low resource use; protects the database from overload	Easy to exhaust under any latency spike or burst
Huge pool	Absorbs slowness so requests rarely wait	Hides the real problem, burns memory, can overload the DB past its `max_connections`
Acquire timeout	Callers fail fast and shed load instead of hanging	Returns errors during incidents; needs caller retry and backoff
Circuit breaker on the slow dependency	Stops sending traffic into the slow path, freeing the pool to recover	Drops a feature while open; needs tuning and a half-open probe
Fix the slow query	Removes the root cause; restores real throughput	Slowest to ship; the others only buy you time

Watch out for

No acquire timeout: requests block forever waiting for a connection, threads pile up, and the whole service hangs instead of failing a few requests cleanly.
Growing maxPoolSize to mask a slow query: it only moves the bottleneck onto the database — now it is the thing that falls over.
Pool larger than the DB allows: if total app pool size exceeds the database's max_connections, you exhaust the database itself instead of your local pool.
Leaked connections: a missing finally or try-with-resources means a connection is borrowed and never returned — the pool bleeds dry one request at a time.
Long transactions across external calls: holding a connection open while you wait on an HTTP call or a lock keeps it checked out far longer than the query needs.

Worked example

Pool of 5. A normal request holds a connection for 20 ms. Each connection can serve 1000 / 20 = 50 requests per second, so the pool's capacity is 5 × 50 = 250 req/s. Comfortable.

Now a downstream call slows to 800 ms. Each connection now serves only 1000 / 800 ≈ 1.25 requests per second, so capacity collapses to 5 × 1.25 ≈ 6 req/s — a 40× drop. At 250 req/s of incoming traffic against ~6 req/s of capacity, the wait queue fills in milliseconds. p99 latency balloons toward the full hold time, and once the queue is full the acquire timeout starts firing: callers get fast 503s instead of hung threads. Right-sizing the pool wouldn't have saved you here — only making that 800 ms call fast (or breaking the circuit) restores real capacity.

Check yourself

Your pool is exhausted because a downstream call suddenly got slow and is holding every connection. What's the best first move?