Cache & queue incidents

Why one bad payload can block millions of healthy messages.

The idea

Message queues (like Kafka or RabbitMQ) process data in order. If a worker receives a corrupted message (a Poison Message) and crashes, the queue simply redelivers it to the next worker. That worker crashes too. Suddenly, you have a Redelivery Storm.

Because the queue refuses to skip the bad message to maintain order, Consumer Lag skyrockets as millions of healthy messages pile up behind it. To fix this, you must configure a Dead Letter Queue (DLQ). After N failed attempts, the system routes the poison message to the DLQ, unblocking the pipeline.

Partition 1 (Queue) Worker Dead Letter Queue
Lag: 5 messages
The queue is processing normally, until it hits the red poison message.

How it works (Handling Poison Messages)

# BAD: Infinite Retries (Head-of-line blocking)
def process_queue():
    while True:
        msg = queue.peek()
        try:
            handle(msg)
            queue.ack(msg) # Removes from queue
        except Exception:
            # Crash! Message is NOT acked.
            # Next loop, it reads the EXACT SAME message again forever!
            continue

# GOOD: Dead Letter Queue Routing
def process_queue():
    while True:
        msg = queue.peek()
        try:
            handle(msg)
            queue.ack(msg)
        except Exception:
            if msg.retries > 3:
                dlq.push(msg) # Move it out of the way!
                queue.ack(msg) # Unblock the main queue!
            else:
                msg.retries += 1