Question
An SQS-driven video-transcoding worker pool alerts at 03:00: the same jobs are being processed by *multiple* workers simultaneously, transcodes are running 2–3x, and S3 shows duplicate output objects. Dashboards: `NumberOfMessagesReceived` is ~2.5x `NumberOfMessagesDeleted`; the queue's `ApproximateAgeOfOldestMessage` is fine. The queue's `VisibilityTimeout` is 30s. Recent context: a new codec was rolled out yesterday that pushed average transcode time from ~20s to ~110s. There's no DLQ configured and `maxReceiveCount` is high. Triage and mitigate this redelivery storm.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.