Health Flapping

When a server repeatedly crashes, restarts, and immediately receives traffic before it's ready.

The idea

A Load Balancer routes traffic only to "Healthy" servers. To check health, it pings an endpoint like /healthz. If a server is struggling with heavy load, it might time out, causing the Load Balancer to mark it "Unhealthy" and stop sending traffic. Because it receives no traffic, the server recovers and quickly reports "Healthy" again. The Load Balancer instantly floods it with traffic, causing it to immediately crash again. The server violently alternates between Healthy and Unhealthy, a destructive cycle known as Health Flapping.

Step 1: The server is Healthy. The Load Balancer sends 100% of traffic to it.

How it works (Hysteresis & Deep Health Checks)

To prevent flapping, you must introduce friction to state changes (Hysteresis). It should be fast to be marked Unhealthy (e.g., 2 failed checks), but slow to be marked Healthy again (e.g., must pass 5 consecutive checks over 30 seconds). Furthermore, /healthz shouldn't just return 200 OK. A Deep Health Check should actually attempt a simple database query to verify the server is genuinely ready to handle load.

// Example Load Balancer Configuration (HAProxy / AWS ALB)

HealthCheck:
  Path: /healthz
  Interval: 10 seconds
  Timeout: 2 seconds
  
  // Fast fail: Mark dead after 2 failures (20 seconds)
  UnhealthyThreshold: 2 
  
  // Slow recovery: Must prove health 5 times (50 seconds)
  // This breaks the "Flapping" loop.
  HealthyThreshold: 5

Cost

Using Deep Health Checks (querying the database) adds load to your database. If you have 100 servers, and the Load Balancer pings them every 5 seconds, that's 20 database queries per second just for health checks. You must balance the depth of the check with the load it places on downstream dependencies.

Watch out for

Cascading Flaps: If you have 3 servers and Server A starts flapping, traffic shifts to Servers B and C. Now B and C are overloaded, so they start failing health checks and flap too. Soon, all servers are flapping in unison, and your entire application goes down. Strict HealthyThresholds combined with Autoscaling are required to survive this.