Question
A stateful cluster (a replicated config/metadata store) runs with replicas split evenly across exactly two data centers — 3 nodes in DC-A, 3 in DC-B — for a clean DR story. The on-call team discovered that if the link between the two DCs drops (a partition), a static majority-of-6 quorum (needs 4) means NEITHER side can form a majority, so the whole cluster goes read-only and the service is down even though both DCs are individually healthy. Worse, an earlier naive attempt to let each side 'go it alone' caused split-brain: both DCs accepted divergent writes. Design a quorum/membership scheme for an even two-site topology that keeps the cluster available during a single-site or link failure WITHOUT ever permitting split-brain. Explain the role of an arbiter/witness, dynamic quorum, and exactly which failures you can survive.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.