Question
Design a multi-region traffic failover system for an active-active service in 4 regions serving 250k requests/sec. When a region degrades (high error rate, latency, or full outage), traffic must drain to healthy regions within ~30 seconds without overloading them, and when the region recovers, traffic must return gradually (no thundering herd). The control loop must not flap regions in and out on transient blips and must avoid a split-brain where two controllers fight over the routing table. Walk through the health model, the failover/failback mechanism, and the central trade-off.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.