On-callHardoc-g389

Subject Queue incidentsLevel Senior–Staff~40 minCommon in Reliability & on-call · Algorithms & data structures interviewsIndustries Technology

Question

An SQS FIFO queue `inventory-commands` (`MessageGroupId = warehouse_id`) processes per-warehouse commands in order. At 12:40 overall throughput collapses: `ApproximateNumberOfMessagesVisible` climbs to 70k, oldest-age rises, but the worker fleet is mostly idle — few messages in flight. Dashboards: most active message groups are draining, but one `warehouse_id` (the new mega-DC onboarded today) has thousands of queued commands. SQS FIFO delivers in-order *within* a group and won't release the next message in a group until the current one is acked. No errors, no DLQ growth. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.