Code Room
On-callMedium
Question
To reduce DB connection pressure, an SRE lowers the per-pod max DB pool size from 20 to 5 via a dynamic config change that propagates fleet-wide over ~2 minutes at 14:30 (no deploy, no image change). Immediately the API's p99 latency on write-heavy endpoints triples and a `pool checkout timeout` error rate appears, but the DB's own metrics show it healthy and now UNDER-utilized (fewer active connections than before). The pods have spare CPU. Triage, explain the mechanism, then mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.