On-callHardoc-g343

Subject Poison messageLevel Senior–Staff~35 minCommon in Distributed systems interviewsIndustries Technology

Question

An analytics consumer (Kafka, 12 partitions, 6 instances) stops making progress on exactly one partition at 14:50. Dashboards: `records-lag` for partition 9 climbs linearly, the other 11 are at zero; the owning consumer logs a `RecordTooLargeException` then retries from the same offset forever; no crash, CPU low. Context: a producer change this morning started occasionally batching multiple events into one record, and a handful exceed the consumer's `max.partition.fetch.bytes` / `fetch.max.bytes`. How do you triage, unblock the partition, and prevent recurrence — without losing data?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.