On-callHardoc-g313

Subject Capacity incidentsLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Black-Friday traffic ramps and your stateless API tier autoscales smoothly from 30 to 90 pods, CPU stays comfortable, and yet p99 latency climbs and a growing share of requests time out. Each pod holds a fixed DB connection pool of 10. Postgres `max_connections` is 400 and the `pg_stat_activity` count is pinned right at 400, with many sessions idle-in-transaction or waiting on a lock. App dashboards show request latency dominated by 'time waiting to acquire a DB connection from the pool.' Adding more pods made it worse, not better. How do you triage and what do you change?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.