Code Room
On-callMediumoc-g089
Subject Capacity incidentsLevel Mid–Senior~30 minCommon in Databases & SQL · Reliability & on-call interviewsIndustries Technology

Question

It's 09:14 and your payments API p99 latency jumped from 80ms to 6s, with a rising trickle of 500s. The app tier dashboards show CPU at 30% and memory flat — nothing looks saturated there. The Postgres dashboard shows active connections pinned at the 200 max for the primary, a deep `ClientRead`/`Lock` wait spike, and `pg_stat_activity` full of sessions in `idle in transaction`. A deploy went out at 09:05 that added a new 'order summary' endpoint. Walk me through how you triage this and what you do in the next 10 minutes versus the next week.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.