On-callMediumoc-g383

Subject Consumer lagLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A RabbitMQ classic queue `image-thumbnail` is consumed by 10 worker pods. At 15:30 queue depth (`ready`) climbs from ~200 to 60k and keeps rising. Dashboards: total consumer throughput dropped to roughly *one tenth* of normal even though all 10 pods are up and connected. Per-pod metrics show ONE pod at high CPU pulling a large `unacked` count pinned at its prefetch (which was recently raised to 5000), while the other 9 pods sit at near-zero unacked and almost idle. No errors, no redeliveries. Recent context: yesterday someone raised `prefetch` (QoS) from 50 to 5000 to 'improve throughput.' How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.