Question
An SQS FIFO queue `inventory-commands` (`MessageGroupId = warehouse_id`) processes per-warehouse commands in order. At 12:40 overall throughput collapses: `ApproximateNumberOfMessagesVisible` climbs to 70k, oldest-age rises, but the worker fleet is mostly idle — few messages in flight. Dashboards: most active message groups are draining, but one `warehouse_id` (the new mega-DC onboarded today) has thousands of queued commands. SQS FIFO delivers in-order *within* a group and won't release the next message in a group until the current one is acked. No errors, no DLQ growth. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.