On-callHardoc-g225

Subject Connection pool exhaustionLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Aurora MySQL. A 20-second primary failover happened at 14:00. The DB recovered, but for the next 8 minutes the new primary sat at `Threads_connected` = max_connections, CPU pegged, and the app threw 'too many connections' even though real query volume was normal. App tier is ~200 stateless pods, each with a 20-connection HikariCP pool and aggressive reconnect-on-error. Connections that *do* get in run fast. Walk through what happened and how you'd both recover now and prevent it.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.