Question
An analytics consumer (Kafka, 12 partitions, 6 instances) stops making progress on exactly one partition at 14:50. Dashboards: `records-lag` for partition 9 climbs linearly, the other 11 are at zero; the owning consumer logs a `RecordTooLargeException` then retries from the same offset forever; no crash, CPU low. Context: a producer change this morning started occasionally batching multiple events into one record, and a handful exceed the consumer's `max.partition.fetch.bytes` / `fetch.max.bytes`. How do you triage, unblock the partition, and prevent recurrence — without losing data?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.