Question
At 14:20 PagerDuty fires: the `order-enrichment` Kafka consumer group (12 partitions, 12 instances) is lagging. Dashboards: end-to-end lag climbed from ~3k to 900k over 25 minutes on every partition roughly evenly. But unlike a normal lag event, the broker-side produce rate on `orders` is *flat* at its usual ~4k msg/s — no traffic spike. The consumers are at ~8% CPU and almost idle. Each consumer does one synchronous HTTP call per record to an internal `customer-profile` service; that service's p99 latency went from 20ms to 1.8s at 14:15 (its own dashboard shows a slow dependency on a Redis cluster doing a failover). `max.poll.interval.ms` is the default 5 min and no rebalances are firing yet. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.