Question
A real-time pricing consumer (Kafka, Protobuf payloads) crash-loops on one partition starting 10:30. Dashboards: partition-7 lag climbs while partitions 0–6 are healthy; the consumer throws `InvalidProtocolBufferException` then exits and is restarted by the orchestrator, re-reads the same offset, and crashes again every few seconds; the consumer group keeps rebalancing because the crashing member leaves and rejoins. A producer team deployed a change that emits a NEW message type onto the SAME topic without a type discriminator, and these new messages only landed on partition-7 due to keying. How do you triage, recover partition-7 throughput, and handle the incompatible messages?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.