Question
At 02:30 the `account-activity` Kafka consumer group's lag suddenly jumps from ~0 to the *entire retention* of the topic (≈2 billion records across 48 partitions) and the consumers start churning at full throttle, re-emitting events downstream. Support pages in: users are getting old notifications resent and a few derived counters look doubled. Dashboards: no broker errors, produce rate normal; the group's committed offsets dropped to the log-start offset on all partitions at 02:30. Recent context: an ops engineer ran an incident-recovery script that recreated the consumer group's `__consumer_offsets` after a separate tooling problem; the consumers run with `auto.offset.reset=earliest`. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.