Question
An SQS FIFO queue `account-commands` processes per-account commands (`MessageGroupId = account_id`) and guarantees in-order processing per account. At 10:00, overall throughput collapses: `ApproximateNumberOfMessagesVisible` climbs to 90k, but the worker fleet is mostly *idle* (low CPU, few in-flight). Dashboards show one message group is stuck — its handler keeps failing and the message is being retried (it's in flight, then back, repeatedly), and `maxReceiveCount` is high so it never DLQs. Recent context: one specific account triggered a command that hits a code path with a null-pointer bug deployed this morning. Explain why the *whole* queue's throughput collapsed and how you mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.