On-callHardoc-g381

Subject Queue incidentsLevel Senior–Staff~40 minCommon in Reliability & on-call · Algorithms & data structures interviewsIndustries Technology

Question

At 02:30 the `account-activity` Kafka consumer group's lag suddenly jumps from ~0 to the *entire retention* of the topic (≈2 billion records across 48 partitions) and the consumers start churning at full throttle, re-emitting events downstream. Support pages in: users are getting old notifications resent and a few derived counters look doubled. Dashboards: no broker errors, produce rate normal; the group's committed offsets dropped to the log-start offset on all partitions at 02:30. Recent context: an ops engineer ran an incident-recovery script that recreated the consumer group's `__consumer_offsets` after a separate tooling problem; the consumers run with `auto.offset.reset=earliest`. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.