On-callHardoc-g329

Subject Cold startLevel Senior–Staff~28 minCommon in Reliability & on-call · Code quality & review interviewsIndustries Technology

Question

Your database has a planned failover to a standby at 03:00 for maintenance. The DB itself comes back in ~10 seconds, but your stateless API fleet then shows a 60–90 second window of high latency and a burst of errors before fully recovering — even though no app instances restarted. Dashboards show, right after failover: a spike in DB connection-establishment time, app-side connection pools draining and refilling, and a thundering set of simultaneous reconnect attempts from all instances at once hitting the new primary. How do you triage and reduce the post-failover recovery window?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.