Thundering Herds (Cache Miss Storm)

When a single expired cache item takes down your whole backend.

The idea

If you cache the homepage of a busy news site for 60 seconds, everything runs fast. But at exactly second 61, the cache expires. If 5,000 users request the homepage at that exact moment, the cache sees a "miss" for all 5,000 users and forwards all of them to the database to fetch the fresh homepage. The database instantly melts. This is the Thundering Herd problem.

Step 1: Cache is valid. 1,000 users hit the cache, DB does zero work.

How it works (Stale-While-Revalidate)

To fix this, we allow the cache to serve slightly stale data while it updates in the background. The very first request after expiration triggers a background fetch, but still receives the stale cache. All other requests also receive the stale cache. No one is blocked, and the database only gets exactly ONE request.

# The HTTP Header solution: Stale-While-Revalidate

# Cache-Control: max-age=60, stale-while-revalidate=120

def get_homepage(request_time):
    item = cache.get("homepage")
    
    # 1. Fresh? Serve it. (0 - 60s)
    if request_time < item.expires_at:
        return item.data
        
    # 2. Stale, but within the revalidate window? (60s - 180s)
    if request_time < item.expires_at + 120:
        if not item.is_updating_in_background:
            item.is_updating_in_background = True
            fire_async_background_task(update_cache) # DB gets 1 request!
            
        # Immediately return the STALE data to the user! No waiting!
        return item.data 
        
    # 3. Completely expired (> 180s). Must block and wait.
    return fetch_from_database_synchronously()

Cost

Users might see data that is a few seconds out of date. You are trading strict consistency (everyone sees the absolute latest data instantly) for extreme availability and low latency.

Watch out for

Mutex Locks: If you can't use `stale-while-revalidate`, you must use a Mutex Lock. On a cache miss, Request 1 acquires the lock and queries the DB. Requests 2-5000 fail to acquire the lock and simply `sleep(50ms)` in a loop, waiting for Request 1 to fill the cache.
Empty Caches on Deploy: When deploying a new Redis cluster, the cache is completely empty. The first wave of traffic will cause a massive Thundering Herd. You must "warm" the cache programmatically before routing real user traffic to it.