On-callHardoc-g071

Subject Network partitionLevel Senior–Staff~45 minCommon in Networking & APIs · Distributed systems interviewsIndustries Technology

Question

Your 3-node primary-replica database cluster spans two availability zones (2 nodes in AZ-a, 1 in AZ-b). A cloud network event partitions AZ-a from AZ-b for ~8 minutes. Dashboards during the partition: write latency in AZ-b spikes and some writes fail; after the partition heals you see a burst of replication conflicts and a few customer reports of 'I updated my profile and it reverted.' Recent context: someone recently 'tuned for availability' and lowered the write-quorum / enabled an auto-failover that promotes aggressively. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.