Multi-region traffic failover

When a whole region goes dark, the gateway notices through health checks and reroutes users to a region that’s still healthy.

The idea

You run the same service in more than one place — say us-east and us-west — so that one region failing doesn’t take everyone offline. A traffic gateway (weighted DNS, anycast, or a global load balancer) sits in front and keeps sending users to a healthy region.

The gateway can’t see inside a region, so it probes each one on a fixed interval with a health check. When a region misses enough checks in a row, the gateway marks it down and shifts traffic elsewhere. Detection takes a few missed probes, and clients caching the old address (DNS TTL) takes a bit longer to expire — so failover is measured in seconds to minutes, never zero.

Traffic flows through the gateway to us-east, the primary healthy region.

How it works

Each region exposes a cheap /healthz endpoint. The gateway probes it on a fixed interval and only reacts after a consecutive-failure threshold, so one blip doesn’t trigger a needless reroute. When a region is marked down, its routing weight drops to zero; when it passes again, the weight is restored. New clients pick up the change as their cached DNS answer expires after the TTL.

INTERVAL   = 10      # seconds between probes
THRESHOLD  = 3       # consecutive misses before "down"
TTL        = 60      # seconds clients cache the DNS answer

misses = {r: 0 for r in regions}

def on_probe(region):                 # runs every INTERVAL
    ok = http_get(region.healthz, timeout=2).status == 200
    if ok:
        misses[region] = 0
        set_weight(region, region.base_weight)   # back in rotation
    else:
        misses[region] += 1
        if misses[region] >= THRESHOLD:          # 3 misses in a row
            set_weight(region, 0)                 # drain: stop new traffic
            # in-flight requests to this region fail and must retry;
            # surviving regions absorb the load until autoscale catches up

# worst-case detection       = THRESHOLD * INTERVAL      (~30s here)
# worst-case client cutover  = detection + TTL           (~90s here)

Active-passive keeps the second region warm but idle until needed (cheaper, slower, possible data loss). Active-active serves both at once (faster cutover, but you must keep data consistent across regions).

Cost

Approach	Failover speed (RTO)	Cost & consistency
Active-active	Low RTO — both regions already serving, just shift weights	Higher cost (full second footprint); cross-region writes make consistency hard, replication lag drives RPO
Active-passive	Higher RTO — passive region may be cold and need to scale up	Cheaper (passive runs lean); simpler one-writer model, but unreplicated writes can be lost on cutover (RPO > 0)
DNS TTL routing	Slower — clients cache the answer for the whole TTL	Simple and cheap; lower the TTL for faster cutover, but more DNS queries
Anycast / global LB	Fast — withdraw a route or change weights, no client cache to wait on	More moving parts and provider lock-in; cutover happens at the network edge

RTO is how long you’re degraded before traffic recovers. RPO is how much recent data you can afford to lose — both are targets you design the topology to meet.

Watch out for

Flapping from a too-low threshold. One slow probe shouldn’t down a region. Too small a consecutive-failure count makes the gateway thrash traffic back and forth on transient blips.
Long DNS TTL means slow failover. Clients keep hitting the dead address until their cached answer expires. A one-hour TTL can leave users stranded long after detection.
The passive region is cold or under-scaled. If us-west runs at a fraction of capacity, shifting 100% of traffic to it overwhelms it. Keep headroom and actually test the failover.
Split-brain and data divergence on failback. If both regions accept writes during the cutover, reconciling them later is painful. Decide who’s authoritative before you fail back.
Replication lag is your RPO. Writes that hadn’t replicated when the region died are lost or stale in the survivor. Measure the lag; it’s the floor on how much data you can lose.
Retries become a thundering herd. Every in-flight request to the dead region retries at once, piling onto the survivor. Use backoff and load-shedding so the rescue doesn’t cause a second outage.

Worked example

An availability-zone outage takes us-east offline. The gateway probes every 10 seconds; the first miss lands almost immediately, and after 3 consecutive misses (~30s) it marks us-east down and sets its weight to zero. In-flight requests to us-east error out and retry — those retries land on us-west.

Clients still holding the cached DNS answer keep trying us-east until the 60s TTL expires, so worst-case cutover is roughly 30s detection + 60s TTL ≈ 90s. us-west was provisioned at 60% of peak capacity, so the sudden 100% load briefly saturates it until autoscaling adds nodes — which is why you keep failover headroom and rehearse this. Once us-east recovers and passes its checks again, you restore its weight and drain traffic back gradually rather than snapping it over all at once.

Check yourself

Health checks run every 10s, the threshold is 3 consecutive misses, and the DNS TTL is 60s. Roughly what’s the worst-case time before a client lands on the healthy region?

Your passive region normally runs at 50% of peak capacity to save money. What’s the main risk the moment failover sends it 100% of traffic?