Code Room
On-callHard
Question
A network partition briefly isolated your primary Postgres from the HA orchestrator. The orchestrator declared it dead and promoted a standby. Then the partition healed — and now you have a serious mess: some app instances (still able to reach the old node) kept writing to the OLD primary, while others wrote to the newly-promoted one. Both nodes accepted writes for ~90 seconds. Now the two diverge: conflicting rows, broken sequences, duplicate ids. Triage, contain the damage, and prevent recurrence.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.