On-callHardoc-g161

Subject Partition skewLevel Senior–Staff~35 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

The `clickstream` Kafka topic has 24 partitions, keyed by `customer_id`, consumed by a 24-instance group (one partition each). Alert: end-to-end lag on the topic is 4M and growing, but only on *one* partition — partition 7's lag is 3.9M while the other 23 are near zero. The consumer instance owning partition 7 is at 95% CPU; the rest are at ~10%. Produce rate is normal in aggregate. Recent context: a single large enterprise customer (`customer_id` 90021) onboarded last week and now generates ~40% of all clickstream volume. Triage and mitigate this partition skew.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.