Question
A `user-events` Kafka consumer (Avro + Schema Registry, 8 partitions, 8 instances) starts crash-looping on partition 2 at 10:40 while the other 7 partitions stay healthy. Dashboards: partition-2 lag climbs linearly; the owning instance throws `SerializationException: Error retrieving Avro schema for id 4127` then exits, the orchestrator restarts it, it re-reads the same offset, and crash-loops. Recent context: a producer team deployed a new event version this morning that registered schema id 4127 — but they registered it in a *different* Schema Registry instance than the one the consumer is configured against. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.