Query of death

One poisoned request, retried forever, knocks over every worker that touches it.

The idea

A query of death is a single request that reliably crashes whatever process handles it — a malformed input, a pathological regex, an allocation that runs the worker out of memory. On its own a crash is survivable. The danger is the retry loop.

When the worker dies, the load balancer assumes bad luck and re-sends the same request to the next healthy worker. That one dies too. The poison pill marches down the fleet, taking out workers faster than they can restart, until nothing is left to serve good traffic.

Press play to send traffic at the fleet.

quarantine after 2 crashes

How it works

The fix is to stop trusting the request after it has proven it kills workers. Track failures per request, not just per worker. Once the same payload has caused N crashes, quarantine it (a poison-pill queue or a "dead-letter" sink) instead of retrying.

def handle(req):
    key = fingerprint(req)              # stable hash of the payload
    if failures[key] >= MAX_RETRIES:    # it already killed workers
        dead_letter.put(req)            # quarantine, never retry
        return 503

    try:
        return process(req)
    except Exception:
        failures[key] += 1              # blame the request, not the worker
        raise                           # let this worker die / restart

The key shift: a normal retry policy counts failures against the worker. A query-of-death defense counts them against the request fingerprint, so a poison pill is caught after a couple of hops instead of circling the whole fleet.

Cost

Aspect	Without defense	With per-request quarantine
Workers lost	Whole fleet	At most N
Blast radius	O(workers)	O(MAX_RETRIES)
Extra state	None	O(distinct failing requests)

The trade-off is a small failure-count map and the risk of quarantining a request that failed for a transient reason — so quarantine should expire, not be permanent.

Watch out for

Counting per worker, not per request. If your retry budget lives on the worker, every fresh worker happily re-runs the same poison pill. The counter has to follow the payload.
Unbounded retries on the load balancer. A retry-on-5xx policy with no cap turns one bad request into a fleet-wide outage. Cap retries and prefer hedging only for idempotent, fast paths.
No fingerprint, or a fingerprint that changes. If the same logical request hashes differently each time (timestamps, request IDs baked in), you can never recognise the repeat offender.
Permanent quarantine. A transient dependency blip can look like a query of death. Expire the quarantine and re-admit slowly, or you will drop legitimate traffic forever.
Crashing instead of failing. An OOM or segfault takes the whole process down, losing in-flight good requests too. Validate inputs and bound work before the dangerous step.

Worked example

A search service accepts a regex filter. A user submits (a+)+$ against a long string — catastrophic backtracking. The worker pins a CPU and the health check kills it. The load balancer reroutes the identical query to worker 2, which dies the same way, then worker 3… In four minutes the fleet is gone and real searches time out.

The fix that shipped: fingerprint the query, count failures by fingerprint, and after two crashes route that exact query to a dead-letter queue returning 503. The next deploy also bounded regex steps so the dangerous work could fail fast instead of crashing. Toggle quarantine after 2 crashes above to watch the blast radius shrink from the whole fleet to two workers.

Check yourself

A bad request is crashing one worker after another across the fleet. Which change stops the cascade?