On-callHardoc-g499

Subject Network partitionLevel Senior–Staff~40 minCommon in Networking & APIs · Distributed systems interviewsIndustries Technology, Software development

Question

Your distributed cache runs primary-replica with an automatic failover daemon (sentinel-style) that promotes a replica if it can't reach the primary. At 03:40 a brief partition isolates the failover daemon (and some clients) from the primary, but the primary itself is healthy and still serving the clients on its side. The daemon promotes the replica. Now you have TWO primaries accepting writes for ~3 minutes; clients on each side see different data, and after the partition heals you find conflicting/overwritten keys. How do you triage and mitigate this SPLIT-BRAIN and prevent recurrence?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.