On-callHardoc-g395

Subject Consumer lagLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A Kafka Streams app doing stateful sessionization (RocksDB state stores backed by changelog topics, 48 partitions, 6 instances) is fine for weeks. At 02:00 one instance's pod is rescheduled to a new node by Kubernetes (node maintenance). Immediately, end-to-end lag on the 8 partitions that instance owns spikes to several million and stays elevated for ~18 minutes, then recovers on its own; the other 40 partitions are unaffected the whole time. CPU on the rescheduled instance is high during the window; disk I/O is saturated. No errors, no poison records. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.