On-callHardoc-g548

Subject On callLevel Mid–Senior~40 minCommon in Reliability & on-call · Distributed systems · Algorithms & data structures interviewsIndustries Technology

Question

Your order-fulfillment service consumes from a Kafka topic `orders.placed` (24 partitions). At 14:20 the consumer-group lag dashboard shows lag climbing linearly across ALL partitions — it's now 1.2M messages and growing ~3k/sec, and customers are reporting orders stuck in 'pending'. The consumers are healthy (not crashed, CPU low), but the processing-rate metric for the group has dropped to nearly zero. Logs show the same record being processed over and over: each attempt throws `JsonParseException` on a record at a fixed offset on partition 7, the consumer doesn't commit, rebalances, and retries from the same offset. A producer team deployed a schema change 30 minutes before this started. How do you triage, stop the bleeding, and prevent recurrence?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.