Question
An order-fulfillment service uses RabbitMQ with a delayed-retry design: the `fulfillment` work queue dead-letters (via DLX) to `fulfillment.wait` which has a per-message TTL of 60s and, on TTL expiry, dead-letters *back* to `fulfillment` to be retried. There is no explicit attempt counter. At 09:10 broker CPU spikes to ~90%, network throughput triples, and the `fulfillment.wait` queue depth oscillates around 80k while `fulfillment` ready-count flickers. The main consumers are healthy and acking normally; effective *useful* throughput has actually dropped. Recent context: a config push 30 min ago made the `inventory` gRPC call time out for a specific SKU family that's ~5% of orders. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.