A slow dependency quietly holds every connection hostage until new requests have nowhere to go.
Opening a fresh database connection is expensive, so services keep a small, fixed-size pool of them — say five. A request borrows a connection, does its work, and returns it for the next request to reuse.
That works beautifully while each request finishes fast. But if a downstream query gets slow, every borrowed connection is held longer. New requests queue up waiting for a free one. Once all five are checked out and the wait queue fills, the next caller can't acquire a connection at all — that's pool exhaustion. The cure is to bound the wait and fix the slow call, not to grow the pool to infinity.
The single most important defence is an acquire timeout: a bounded wait for a free connection. If none frees up in, say, 250 ms, the caller fails fast and sheds load instead of hanging forever and piling up threads. Pair that with try/finally (or try-with-resources) so a connection is always returned, even on error.
// A bounded pool: 5 connections, fail fast if none frees up.
HikariConfig cfg = new HikariConfig();
cfg.setMaximumPoolSize(5);
cfg.setConnectionTimeout(250); // acquire timeout: wait at most 250 ms
DataSource pool = new HikariDataSource(cfg);
void handle(Request r) {
// getConnection() blocks up to 250 ms, then THROWS instead of hanging.
try (Connection c = pool.getConnection()) { // returned automatically
runQuery(c, r);
} catch (SQLTransientConnectionException timeout) {
// No connection was free in time. Shed load: 503, don't queue forever.
respond(503, "busy, retry shortly");
}
// The real cure is making runQuery() fast (or circuit-breaking the
// slow dependency) — NOT setting maxPoolSize to infinity, which just
// moves the bottleneck onto the database and overloads it.
}
| Lever | Effect | Risk |
|---|---|---|
| Tiny pool | Low resource use; protects the database from overload | Easy to exhaust under any latency spike or burst |
| Huge pool | Absorbs slowness so requests rarely wait | Hides the real problem, burns memory, can overload the DB past its max_connections |
| Acquire timeout | Callers fail fast and shed load instead of hanging | Returns errors during incidents; needs caller retry and backoff |
| Circuit breaker on the slow dependency | Stops sending traffic into the slow path, freeing the pool to recover | Drops a feature while open; needs tuning and a half-open probe |
| Fix the slow query | Removes the root cause; restores real throughput | Slowest to ship; the others only buy you time |
maxPoolSize to mask a slow query: it only moves the bottleneck onto the database — now it is the thing that falls over.max_connections, you exhaust the database itself instead of your local pool.finally or try-with-resources means a connection is borrowed and never returned — the pool bleeds dry one request at a time.Pool of 5. A normal request holds a connection for 20 ms. Each connection can serve 1000 / 20 = 50 requests per second, so the pool's capacity is 5 × 50 = 250 req/s. Comfortable.
Now a downstream call slows to 800 ms. Each connection now serves only 1000 / 800 ≈ 1.25 requests per second, so capacity collapses to 5 × 1.25 ≈ 6 req/s — a 40× drop. At 250 req/s of incoming traffic against ~6 req/s of capacity, the wait queue fills in milliseconds. p99 latency balloons toward the full hold time, and once the queue is full the acquire timeout starts firing: callers get fast 503s instead of hung threads. Right-sizing the pool wouldn't have saved you here — only making that 800 ms call fast (or breaking the circuit) restores real capacity.
Your pool is exhausted because a downstream call suddenly got slow and is holding every connection. What's the best first move?