On-callMediumoc-g477

Subject Config changeLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

To reduce DB connection pressure, an SRE lowers the per-pod max DB pool size from 20 to 5 via a dynamic config change that propagates fleet-wide over ~2 minutes at 14:30 (no deploy, no image change). Immediately the API's p99 latency on write-heavy endpoints triples and a `pool checkout timeout` error rate appears, but the DB's own metrics show it healthy and now UNDER-utilized (fewer active connections than before). The pods have spare CPU. Triage, explain the mechanism, then mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.