Question
An events pipeline writes to a Kafka topic `audit-log` (16 partitions) keyed by `org_id`, consumed by a 16-instance group. At 11:00 lag alerts fire but ONLY on partition 4: its lag is 6M and climbing while the other 15 partitions are at ~0. The instance owning partition 4 is pegged at 100% CPU; the rest idle. No errors, no poison message — the records on partition 4 are all valid and from a single `org_id`. Recent context: a large enterprise customer (`org_id=ACME`) turned on verbose audit logging this morning, and they generate ~70% of all audit events. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.