On-callMediumoc-g553

Subject On callLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

At 12:10 your reporting API starts returning 500s under moderate load. Logs are full of `HikariPool-1 - Connection is not available, request timed out after 30000ms`. The Postgres server itself is healthy — CPU 30%, only 60 of its `max_connections=200` in use, no long-running queries on its side. The app pool is configured for 50 connections and they're all checked out, but DB-side activity is low, so the connections appear idle-in-pool-but-held by the app. A feature shipped this morning added a new 'export full history' endpoint. Throughput on the rest of the API has also degraded. How do you triage and fix this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.