On-callMediumoc-g162

Subject Poison pillLevel Mid–Senior~30 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

A Kafka consumer for the `inventory-updates` topic (using a JSON/Avro deserializer) starts crash-looping at 16:20. Logs show a `DeserializationException` repeated thousands of times for the same offset on partition 3; the process catches the exception, logs it, and the container's restart policy brings it back — which re-reads the same offset and crashes again. Lag on partition 3 climbs while partitions 0–2 keep up. Recent context: a producer team deployed a schema change at 16:15 that emitted a few messages with a malformed payload before they rolled back. How do you triage and get the consumer unstuck, and how do you prevent the crash loop?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.