On-callMediumoc-g159

Subject Dead letter overflowLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 09:00 an alert fires: the `email-send-dlq` (an SQS dead-letter queue) crossed 50k messages and its `ApproximateNumberOfMessagesVisible` is growing ~400/min. The main `email-send` queue looks healthy (low depth, low age). The consumer's `maxReceiveCount` is 5. Recent context: a vendor changed their email API's auth response yesterday — calls now return HTTP 401 for messages that include a legacy `from` domain (about 8% of sends), and the consumer treats any non-2xx as a retryable failure. CloudWatch shows the consumer's error rate at a steady ~8%. How do you triage and stop the DLQ bleed, and what's the cleanup plan?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.