Question
At 12:10 your reporting API starts returning 500s under moderate load. Logs are full of `HikariPool-1 - Connection is not available, request timed out after 30000ms`. The Postgres server itself is healthy — CPU 30%, only 60 of its `max_connections=200` in use, no long-running queries on its side. The app pool is configured for 50 connections and they're all checked out, but DB-side activity is low, so the connections appear idle-in-pool-but-held by the app. A feature shipped this morning added a new 'export full history' endpoint. Throughput on the rest of the API has also degraded. How do you triage and fix this?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.