On-callMediumoc-g155

Subject Consumer lagLevel Mid–Senior~30 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

PagerDuty fires at 02:14: the `orders-enrichment` Kafka consumer group is alerting on consumer lag. Dashboards: end-to-end lag on the `orders` topic (12 partitions) has climbed from a steady ~2k to 1.8M records over 40 minutes and is still rising linearly. Broker-side produce rate is flat at its usual ~9k msg/s. Consumer group has 4 instances; CPU on them is ~30%, and the consumer's `records-consumed-rate` JMX metric dropped to near zero ~35 minutes ago. The group's `rebalance-rate` shows a spike right before the drop. A deploy of the enrichment service went out 45 minutes ago. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.