On-callMediumoc-g167

Subject Redelivery stormLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

An SQS-driven video-transcoding worker pool alerts at 03:00: the same jobs are being processed by *multiple* workers simultaneously, transcodes are running 2–3x, and S3 shows duplicate output objects. Dashboards: `NumberOfMessagesReceived` is ~2.5x `NumberOfMessagesDeleted`; the queue's `ApproximateAgeOfOldestMessage` is fine. The queue's `VisibilityTimeout` is 30s. Recent context: a new codec was rolled out yesterday that pushed average transcode time from ~20s to ~110s. There's no DLQ configured and `maxReceiveCount` is high. Triage and mitigate this redelivery storm.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.