Question
A Kafka consumer for the `fraud-scoring` topic (read_committed, 16 partitions, 8 instances) feeds an external scoring API and produces results to an output topic. At 22:30 the output topic shows a flood of *duplicate* score results, the input consumer group's `commit-rate` JMX metric drops to near zero, and `rebalance-rate` is firing continuously; lag on the input topic is rising even though the instances are busy at ~80% CPU. Recent context: the scoring API's latency degraded at 22:20 — a single record's `process()` now sometimes takes 45s. The consumer's `max.poll.interval.ms` is 30000 (30s) and offsets are committed only after each batch finishes. Triage and explain the duplicate-storm and the stalled commits.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.