On-callMediumoc-g393

Subject Poison pillLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A `user-events` Kafka consumer (Avro + Schema Registry, 8 partitions, 8 instances) starts crash-looping on partition 2 at 10:40 while the other 7 partitions stay healthy. Dashboards: partition-2 lag climbs linearly; the owning instance throws `SerializationException: Error retrieving Avro schema for id 4127` then exits, the orchestrator restarts it, it re-reads the same offset, and crash-loops. Recent context: a producer team deployed a new event version this morning that registered schema id 4127 — but they registered it in a *different* Schema Registry instance than the one the consumer is configured against. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.