Code Room
On-callHard
Question
The `clickstream` Kafka topic has 24 partitions, keyed by `customer_id`, consumed by a 24-instance group (one partition each). Alert: end-to-end lag on the topic is 4M and growing, but only on *one* partition — partition 7's lag is 3.9M while the other 23 are near zero. The consumer instance owning partition 7 is at 95% CPU; the rest are at ~10%. Produce rate is normal in aggregate. Recent context: a single large enterprise customer (`customer_id` 90021) onboarded last week and now generates ~40% of all clickstream volume. Triage and mitigate this partition skew.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.