On-callHardoc-g224

Subject FailoverLevel Senior–Staff~45 minCommon in Reliability & on-call · Distributed systems interviewsIndustries Technology, Software development

Question

Postgres with Patroni + etcd. A network blip caused a failover: replica B was promoted to primary. After the blip cleared, the app is reporting intermittent 'duplicate key' errors and some users see writes that later vanish on refresh. Dashboards show TWO nodes advertising as primary for ~90 seconds, and your connection pooler (pgBouncer + a VIP) was briefly routing to both. Replication is now reporting a diverged timeline. Triage, contain the damage, and prevent split-brain next time.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.