Code Room
On-callHard
Question
A Kafka consumer group processes `payment.capture` events and calls a payment gateway, then commits offsets. The team believes processing is 'exactly once'. At 11:05 an autoscaler added two pods, triggering a consumer-group rebalance. Over the next 10 minutes, ~300 captures were sent to the gateway *twice*. Dashboards: consumer lag briefly spiked then recovered, no errors, no crashes — just the rebalance. Each handler takes ~4s (a slow gateway call) before it commits. How do you triage, stop the double-captures, and reconcile the duplicates?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.