On-callHardoc-g130

Subject Poison messageLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

A real-time pricing consumer (Kafka, Protobuf payloads) crash-loops on one partition starting 10:30. Dashboards: partition-7 lag climbs while partitions 0–6 are healthy; the consumer throws `InvalidProtocolBufferException` then exits and is restarted by the orchestrator, re-reads the same offset, and crashes again every few seconds; the consumer group keeps rebalancing because the crashing member leaves and rejoins. A producer team deployed a change that emits a NEW message type onto the SAME topic without a type discriminator, and these new messages only landed on partition-7 due to keying. How do you triage, recover partition-7 throughput, and handle the incompatible messages?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.