On-callHardoc-g341

Subject Duplicate processingLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A Kafka consumer group processes `payment.capture` events and calls a payment gateway, then commits offsets. The team believes processing is 'exactly once'. At 11:05 an autoscaler added two pods, triggering a consumer-group rebalance. Over the next 10 minutes, ~300 captures were sent to the gateway *twice*. Dashboards: consumer lag briefly spiked then recovered, no errors, no crashes — just the rebalance. Each handler takes ~4s (a slow gateway call) before it commits. How do you triage, stop the double-captures, and reconcile the duplicates?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.