Code Room
On-callHard
Question
Aurora MySQL. A 20-second primary failover happened at 14:00. The DB recovered, but for the next 8 minutes the new primary sat at `Threads_connected` = max_connections, CPU pegged, and the app threw 'too many connections' even though real query volume was normal. App tier is ~200 stateless pods, each with a 20-connection HikariCP pool and aggressive reconnect-on-error. Connections that *do* get in run fast. Walk through what happened and how you'd both recover now and prevent it.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.