Question
At 09:00 an alert fires: the `email-send-dlq` (an SQS dead-letter queue) crossed 50k messages and its `ApproximateNumberOfMessagesVisible` is growing ~400/min. The main `email-send` queue looks healthy (low depth, low age). The consumer's `maxReceiveCount` is 5. Recent context: a vendor changed their email API's auth response yesterday — calls now return HTTP 401 for messages that include a legacy `from` domain (about 8% of sends), and the consumer treats any non-2xx as a retryable failure. CloudWatch shows the consumer's error rate at a steady ~8%. How do you triage and stop the DLQ bleed, and what's the cleanup plan?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.