Question
03:40, full outage. The primary Postgres node behind your payments service stopped responding to health checks and your Patroni/etcd HA stack tried to fail over to a standby — but the promotion stalled. The app is throwing connection errors; writes are 100% failing. Patroni logs show the leader lock in etcd is still held and the standby refuses to promote because it can't confirm the old primary is down ("failed to acquire leader, another node holds the lock"). The old primary's host is reachable on ping but Postgres there is unresponsive (likely fenced by a stuck fsync on a degraded disk). One etcd member is also flapping. You're losing ~$5k/min. Walk through your triage and recovery.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.