Query of death

One poisoned request, retried forever, knocks over every worker that touches it.

The idea

A query of death is a single request that reliably crashes whatever process handles it — a malformed input, a pathological regex, an allocation that runs the worker out of memory. On its own a crash is survivable. The danger is the retry loop.

When the worker dies, the load balancer assumes bad luck and re-sends the same request to the next healthy worker. That one dies too. The poison pill marches down the fleet, taking out workers faster than they can restart, until nothing is left to serve good traffic.

Press play to send traffic at the fleet.

How it works

The fix is to stop trusting the request after it has proven it kills workers. Track failures per request, not just per worker. Once the same payload has caused N crashes, quarantine it (a poison-pill queue or a "dead-letter" sink) instead of retrying.

def handle(req):
    key = fingerprint(req)              # stable hash of the payload
    if failures[key] >= MAX_RETRIES:    # it already killed workers
        dead_letter.put(req)            # quarantine, never retry
        return 503

    try:
        return process(req)
    except Exception:
        failures[key] += 1              # blame the request, not the worker
        raise                           # let this worker die / restart

The key shift: a normal retry policy counts failures against the worker. A query-of-death defense counts them against the request fingerprint, so a poison pill is caught after a couple of hops instead of circling the whole fleet.

Cost

AspectWithout defenseWith per-request quarantine
Workers lostWhole fleetAt most N
Blast radiusO(workers)O(MAX_RETRIES)
Extra stateNoneO(distinct failing requests)

The trade-off is a small failure-count map and the risk of quarantining a request that failed for a transient reason — so quarantine should expire, not be permanent.

Watch out for

Worked example

A search service accepts a regex filter. A user submits (a+)+$ against a long string — catastrophic backtracking. The worker pins a CPU and the health check kills it. The load balancer reroutes the identical query to worker 2, which dies the same way, then worker 3… In four minutes the fleet is gone and real searches time out.

The fix that shipped: fingerprint the query, count failures by fingerprint, and after two crashes route that exact query to a dead-letter queue returning 503. The next deploy also bounded regex steps so the dangerous work could fail fast instead of crashing. Toggle quarantine after 2 crashes above to watch the blast radius shrink from the whole fleet to two workers.

Check yourself

A bad request is crashing one worker after another across the fleet. Which change stops the cascade?