On-callHardoc-g436

Subject FailoverLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A network partition briefly isolated your primary Postgres from the HA orchestrator. The orchestrator declared it dead and promoted a standby. Then the partition healed — and now you have a serious mess: some app instances (still able to reach the old node) kept writing to the OLD primary, while others wrote to the newly-promoted one. Both nodes accepted writes for ~90 seconds. Now the two diverge: conflicting rows, broken sequences, duplicate ids. Triage, contain the damage, and prevent recurrence.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.