Code Room
On-callHard
Question
Your 3-node primary-replica database cluster spans two availability zones (2 nodes in AZ-a, 1 in AZ-b). A cloud network event partitions AZ-a from AZ-b for ~8 minutes. Dashboards during the partition: write latency in AZ-b spikes and some writes fail; after the partition heals you see a burst of replication conflicts and a few customer reports of 'I updated my profile and it reverted.' Recent context: someone recently 'tuned for availability' and lowered the write-quorum / enabled an auto-failover that promotes aggressively. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.