On-callMediumoc-g513

Subject Capacity incidentsLevel Mid–Senior~25 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your API tier autoscales beautifully and stays at ~40% CPU through a marketing-driven 3x surge, yet p99 climbs and `pool timeout: could not acquire connection within 5000ms` errors appear at the app. Each pod has an HTTP client pool to a downstream `pricing` service sized at 20 connections — a number chosen long ago to match steady-state concurrency. The downstream `pricing` service itself reports normal latency and is not saturated. Pods are plentiful. How do you triage and mitigate this surge-time failure?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.