On-callHardoc-g612

Subject Database failoverLevel Senior–Staff~45 minCommon in Databases & SQL · Reliability & on-call interviewsIndustries Technology

Question

03:40, full outage. The primary Postgres node behind your payments service stopped responding to health checks and your Patroni/etcd HA stack tried to fail over to a standby — but the promotion stalled. The app is throwing connection errors; writes are 100% failing. Patroni logs show the leader lock in etcd is still held and the standby refuses to promote because it can't confirm the old primary is down ("failed to acquire leader, another node holds the lock"). The old primary's host is reachable on ping but Postgres there is unresponsive (likely fenced by a stuck fsync on a degraded disk). One etcd member is also flapping. You're losing ~$5k/min. Walk through your triage and recovery.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.