Question
A global API is fronted by anycast/GSLB that steers each user to the nearest of 6 regions. When a region degrades (not a clean down — elevated errors/latency from a bad dependency), automated failover should drain it. But last quarter an automated drain of one region shifted its full traffic to the neighbor, which then saturated and tripped its own drain, cascading until a 'capacity meltdown' took down 4 regions even though only 1 was originally unhealthy. Design region health detection + traffic steering that fails over a genuinely-bad region WITHOUT triggering this cascading failover / metastable collapse. Cover how you decide a region is unhealthy, where the failed-over traffic goes, and the safety limits.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.