On-callHardoc-g176

Subject Redelivery stormLevel Senior–Staff~40 minCommon in Databases & SQL · Distributed systems · Algorithms & data structures interviewsIndustries Technology, Software development

Question

A Kafka consumer for the `fraud-scoring` topic (read_committed, 16 partitions, 8 instances) feeds an external scoring API and produces results to an output topic. At 22:30 the output topic shows a flood of *duplicate* score results, the input consumer group's `commit-rate` JMX metric drops to near zero, and `rebalance-rate` is firing continuously; lag on the input topic is rising even though the instances are busy at ~80% CPU. Recent context: the scoring API's latency degraded at 22:20 — a single record's `process()` now sometimes takes 45s. The consumer's `max.poll.interval.ms` is 30000 (30s) and offsets are committed only after each batch finishes. Triage and explain the duplicate-storm and the stalled commits.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.