Code Room
On-callHard
Question
Postgres with Patroni + etcd. A network blip caused a failover: replica B was promoted to primary. After the blip cleared, the app is reporting intermittent 'duplicate key' errors and some users see writes that later vanish on refresh. Dashboards show TWO nodes advertising as primary for ~90 seconds, and your connection pooler (pgBouncer + a VIP) was briefly routing to both. Replication is now reporting a diverged timeline. Triage, contain the damage, and prevent split-brain next time.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.