On-callHardoc-g173

Subject Consumer lagLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

A Kafka Streams app doing stateful aggregation (RocksDB state stores, 48 partitions, 8 instances) is deployed via a rolling restart for a routine config change. During the rollout, end-to-end lag spikes to several million and stays elevated for ~25 minutes per instance bounce; the app's `rebalance` JMX metrics fire on every pod restart, and instances spend minutes in a `RESTORING` state reading from the `__changelog` topic. The cluster is otherwise healthy. This rolling restart happens on every deploy and each one causes a multi-minute lag spike. Triage why the lag spikes during deploys and how you'd mitigate it.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.