Question
A Kafka Streams app doing stateful sessionization (RocksDB state stores backed by changelog topics, 48 partitions, 6 instances) is fine for weeks. At 02:00 one instance's pod is rescheduled to a new node by Kubernetes (node maintenance). Immediately, end-to-end lag on the 8 partitions that instance owns spikes to several million and stays elevated for ~18 minutes, then recovers on its own; the other 40 partitions are unaffected the whole time. CPU on the rescheduled instance is high during the window; disk I/O is saturated. No errors, no poison records. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.