Question
A Kafka consumer for the `inventory-updates` topic (using a JSON/Avro deserializer) starts crash-looping at 16:20. Logs show a `DeserializationException` repeated thousands of times for the same offset on partition 3; the process catches the exception, logs it, and the container's restart policy brings it back — which re-reads the same offset and crashes again. Lag on partition 3 climbs while partitions 0–2 keep up. Recent context: a producer team deployed a schema change at 16:15 that emitted a few messages with a malformed payload before they rolled back. How do you triage and get the consumer unstuck, and how do you prevent the crash loop?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.