Question
A `metrics-rollup` Kafka consumer group reads a topic with 60 partitions using 20 instances. Traffic is even across all 60 partitions (verified: per-partition produce rates are within 5% of each other). At 14:00 lag starts building unevenly: a few instances are at 95% CPU with high lag on their assigned partitions, while several instances sit near-idle with low lag. Dashboards: the partition-to-instance assignment is lopsided — some instances own 5–6 partitions, others own 1. No rebalances are currently firing. Recent context: the deployment scaled from 15 to 20 instances via a rolling restart 40 minutes ago, and the group uses the default `RangeAssignor`. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.