On-callHardoc-g376

Subject Consumer lagLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Starting 03:50, the `notifications` Kafka consumer group (24 partitions, 12 instances) enters a state where lag oscillates wildly — climbing to 2M, dropping to 200k, climbing again — and never stabilizes. Dashboards: the group's `rebalance-rate-per-hour` JMX metric is pinned high, instances log `Member X sending LeaveGroup ... consumer poll timeout has expired` and `Attempt to heartbeat failed ... group is rebalancing` in a loop. CPU is moderate, no OOMs. Recent context: yesterday a new feature added a per-message call to a templating service whose p99 is ~600ms, and message volume is normal. `max.poll.records=500`, `max.poll.interval.ms=300000` (default 5 min), `session.timeout.ms=10000`. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.