Question
A Kafka Streams app doing stateful aggregation (RocksDB state stores, 48 partitions, 8 instances) is deployed via a rolling restart for a routine config change. During the rollout, end-to-end lag spikes to several million and stays elevated for ~25 minutes per instance bounce; the app's `rebalance` JMX metrics fire on every pod restart, and instances spend minutes in a `RESTORING` state reading from the `__changelog` topic. The cluster is otherwise healthy. This rolling restart happens on every deploy and each one causes a multi-minute lag spike. Triage why the lag spikes during deploys and how you'd mitigate it.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.