On-callHardoc-g286

Subject Version skewLevel Senior–Staff~40 minCommon in Distributed systems · Algorithms & data structures interviewsIndustries Technology, Software development

Question

A rolling deploy updates both the producer and consumer of a Kafka topic. The new producer adds a required field to the event payload and bumps the schema; the new consumer handles it. Mid-rollout, consumer lag on the topic starts climbing and a subset of partitions stall completely; a dead-letter count rises. Dashboards: producer pods are ~60% new / 40% old; consumer pods are ~50/50; no HTTP errors. The OLD consumers, when they receive a NEW-schema message, throw on the missing-in-their-view changed field and — because the consumer retries the same offset on failure without advancing — get stuck reprocessing the same poison message, blocking the partition (head-of-line). NEW consumers on old messages are fine. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.