Question
Your order-fulfillment service consumes from a Kafka topic `orders.placed` (24 partitions). At 14:20 the consumer-group lag dashboard shows lag climbing linearly across ALL partitions — it's now 1.2M messages and growing ~3k/sec, and customers are reporting orders stuck in 'pending'. The consumers are healthy (not crashed, CPU low), but the processing-rate metric for the group has dropped to nearly zero. Logs show the same record being processed over and over: each attempt throws `JsonParseException` on a record at a fixed offset on partition 7, the consumer doesn't commit, rebalances, and retries from the same offset. A producer team deployed a schema change 30 minutes before this started. How do you triage, stop the bleeding, and prevent recurrence?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.