On-callHardoc-g386

Subject Partition skewLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

An events pipeline writes to a Kafka topic `audit-log` (16 partitions) keyed by `org_id`, consumed by a 16-instance group. At 11:00 lag alerts fire but ONLY on partition 4: its lag is 6M and climbing while the other 15 partitions are at ~0. The instance owning partition 4 is pegged at 100% CPU; the rest idle. No errors, no poison message — the records on partition 4 are all valid and from a single `org_id`. Recent context: a large enterprise customer (`org_id=ACME`) turned on verbose audit logging this morning, and they generate ~70% of all audit events. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.