On-callMediumoc-g212

Subject Retry stormLevel Mid–Senior~30 minCommon in Algorithms & data structures interviewsIndustries Technology, Software development

Question

At 10:40 your SQS-backed worker fleet's processing throughput collapses and the queue depth explodes to millions. Dashboards: a single message type started failing at 10:35; the consumers retry failed messages, and because visibility-timeout redelivery + your app-level retry both fire, the same messages cycle endlessly; CPU on workers is high but 'useful work done' (successful completions) is near zero; the DLQ is empty (maxReceiveCount is set very high). A schema change shipped at 10:34. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.