On-callHardoc-g394

Subject Partition skewLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A `metrics-rollup` Kafka consumer group reads a topic with 60 partitions using 20 instances. Traffic is even across all 60 partitions (verified: per-partition produce rates are within 5% of each other). At 14:00 lag starts building unevenly: a few instances are at 95% CPU with high lag on their assigned partitions, while several instances sit near-idle with low lag. Dashboards: the partition-to-instance assignment is lopsided — some instances own 5–6 partitions, others own 1. No rebalances are currently firing. Recent context: the deployment scaled from 15 to 20 instances via a rolling restart 40 minutes ago, and the group uses the default `RangeAssignor`. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.