On-callHardoc-g392

Subject Redelivery stormLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A `loyalty-points` Kafka consumer (manual offset commit, 12 partitions, 6 instances) calls an external points API per record, then commits offsets in a batch every 30s. At 23:10 customers report points credited multiple times; finance sees inflated balances. Dashboards: no lag, no consumer crashes, throughput looks normal. But the consumer group's `commit-rate` JMX metric has dropped to near zero on several instances since 23:05, while records continue to be processed. Logs show occasional `CommitFailedException: ... rebalance` warnings the handler catches and ignores. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.