Question
A RabbitMQ classic queue `image-thumbnail` is consumed by 10 worker pods. At 15:30 queue depth (`ready`) climbs from ~200 to 60k and keeps rising. Dashboards: total consumer throughput dropped to roughly *one tenth* of normal even though all 10 pods are up and connected. Per-pod metrics show ONE pod at high CPU pulling a large `unacked` count pinned at its prefetch (which was recently raised to 5000), while the other 9 pods sit at near-zero unacked and almost idle. No errors, no redeliveries. Recent context: yesterday someone raised `prefetch` (QoS) from 50 to 5000 to 'improve throughput.' How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.