On-callHardoc-g301

Subject Network partitionLevel Senior–Staff~45 minCommon in Networking & APIs · Distributed systems interviewsIndustries Technology

Question

Your primary-replica datastore uses a separate failover controller that promotes a replica when it can't reach the primary. At 03:10 a brief network partition isolates the failover controller from the primary (but the primary itself is healthy and still serving app writes from clients on its side of the partition). The controller promotes the replica. The partition heals at 03:14. Now you have two nodes that both accepted writes for ~4 minutes, with conflicting data. Dashboards: write success was ~100% the whole time on BOTH nodes; no errors; clients on each side were happy. How do you triage and mitigate the damage?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.