Code Room
On-callHard
Question
A Kafka consumer group (Java) processes an events topic with 12 partitions, in-order per partition. At 13:00 consumer lag on exactly one partition starts growing unboundedly while the other 11 partitions stay at near-zero lag. Throughput on that one partition collapsed. Logs from the consumer assigned to that partition show it repeatedly trying to process the same message offset, throwing, and retrying — it never commits past that offset. A producer started emitting a new event type at 12:55. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.