One poisoned request, retried forever, knocks over every worker that touches it.
A query of death is a single request that reliably crashes whatever process handles it — a malformed input, a pathological regex, an allocation that runs the worker out of memory. On its own a crash is survivable. The danger is the retry loop.
When the worker dies, the load balancer assumes bad luck and re-sends the same request to the next healthy worker. That one dies too. The poison pill marches down the fleet, taking out workers faster than they can restart, until nothing is left to serve good traffic.
The fix is to stop trusting the request after it has proven it kills workers. Track failures per request, not just per worker. Once the same payload has caused N crashes, quarantine it (a poison-pill queue or a "dead-letter" sink) instead of retrying.
def handle(req):
key = fingerprint(req) # stable hash of the payload
if failures[key] >= MAX_RETRIES: # it already killed workers
dead_letter.put(req) # quarantine, never retry
return 503
try:
return process(req)
except Exception:
failures[key] += 1 # blame the request, not the worker
raise # let this worker die / restart
The key shift: a normal retry policy counts failures against the worker. A query-of-death defense counts them against the request fingerprint, so a poison pill is caught after a couple of hops instead of circling the whole fleet.
| Aspect | Without defense | With per-request quarantine |
|---|---|---|
| Workers lost | Whole fleet | At most N |
| Blast radius | O(workers) | O(MAX_RETRIES) |
| Extra state | None | O(distinct failing requests) |
The trade-off is a small failure-count map and the risk of quarantining a request that failed for a transient reason — so quarantine should expire, not be permanent.
A search service accepts a regex filter. A user submits (a+)+$ against a long string — catastrophic backtracking. The worker pins a CPU and the health check kills it. The load balancer reroutes the identical query to worker 2, which dies the same way, then worker 3… In four minutes the fleet is gone and real searches time out.
The fix that shipped: fingerprint the query, count failures by fingerprint, and after two crashes route that exact query to a dead-letter queue returning 503. The next deploy also bounded regex steps so the dangerous work could fail fast instead of crashing. Toggle quarantine after 2 crashes above to watch the blast radius shrink from the whole fleet to two workers.
A bad request is crashing one worker after another across the fleet. Which change stops the cascade?