On-callHardoc-g038

Subject Head of line blockingLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

A Kafka consumer group (Java) processes an events topic with 12 partitions, in-order per partition. At 13:00 consumer lag on exactly one partition starts growing unboundedly while the other 11 partitions stay at near-zero lag. Throughput on that one partition collapsed. Logs from the consumer assigned to that partition show it repeatedly trying to process the same message offset, throwing, and retrying — it never commits past that offset. A producer started emitting a new event type at 12:55. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.