Code Room
System designHardsd-g454
Subject FailoverLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A global API is fronted by anycast/GSLB that steers each user to the nearest of 6 regions. When a region degrades (not a clean down — elevated errors/latency from a bad dependency), automated failover should drain it. But last quarter an automated drain of one region shifted its full traffic to the neighbor, which then saturated and tripped its own drain, cascading until a 'capacity meltdown' took down 4 regions even though only 1 was originally unhealthy. Design region health detection + traffic steering that fails over a genuinely-bad region WITHOUT triggering this cascading failover / metastable collapse. Cover how you decide a region is unhealthy, where the failed-over traffic goes, and the safety limits.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.