On-callHardoc-g165

Subject Backlog buildupLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

RabbitMQ single-node broker (classic queues). At 15:40 producers start reporting blocked publishes and rising publish latency. The broker dashboard shows `mem_used` at 0.95 of `vm_memory_high_watermark`, the node's `connection.blocked` count spiking, and the `orders` queue depth at 2.1M and climbing (messages are not `persistent`; consumers were already lagging). The management UI shows the broker in a memory alarm state. Recent context: the main `orders` consumer group was scaled *down* from 12 to 3 workers in a cost-cutting change yesterday, and inbound order volume is at a normal daily peak. Triage and mitigate before the broker falls over.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.