Code Room
On-callMedium
Question
Your API tier autoscales beautifully and stays at ~40% CPU through a marketing-driven 3x surge, yet p99 climbs and `pool timeout: could not acquire connection within 5000ms` errors appear at the app. Each pod has an HTTP client pool to a downstream `pricing` service sized at 20 connections — a number chosen long ago to match steady-state concurrency. The downstream `pricing` service itself reports normal latency and is not saturated. Pods are plentiful. How do you triage and mitigate this surge-time failure?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.