Question
PagerDuty fires at 02:14: the `orders-enrichment` Kafka consumer group is alerting on consumer lag. Dashboards: end-to-end lag on the `orders` topic (12 partitions) has climbed from a steady ~2k to 1.8M records over 40 minutes and is still rising linearly. Broker-side produce rate is flat at its usual ~9k msg/s. Consumer group has 4 instances; CPU on them is ~30%, and the consumer's `records-consumed-rate` JMX metric dropped to near zero ~35 minutes ago. The group's `rebalance-rate` shows a spike right before the drop. A deploy of the enrichment service went out 45 minutes ago. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.