Question
Black-Friday traffic ramps and your stateless API tier autoscales smoothly from 30 to 90 pods, CPU stays comfortable, and yet p99 latency climbs and a growing share of requests time out. Each pod holds a fixed DB connection pool of 10. Postgres `max_connections` is 400 and the `pg_stat_activity` count is pinned right at 400, with many sessions idle-in-transaction or waiting on a lock. App dashboards show request latency dominated by 'time waiting to acquire a DB connection from the pool.' Adding more pods made it worse, not better. How do you triage and what do you change?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.