Question
A `loyalty-points` Kafka consumer (manual offset commit, 12 partitions, 6 instances) calls an external points API per record, then commits offsets in a batch every 30s. At 23:10 customers report points credited multiple times; finance sees inflated balances. Dashboards: no lag, no consumer crashes, throughput looks normal. But the consumer group's `commit-rate` JMX metric has dropped to near zero on several instances since 23:05, while records continue to be processed. Logs show occasional `CommitFailedException: ... rebalance` warnings the handler catches and ignores. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.