Question
An `invoicing` Kafka consumer (16 partitions, 8 instances, keyed by `tenant_id`) stops making progress for *one tenant's* invoices at 13:05 while every other tenant is fine. Dashboards: `records-lag` for partition 11 climbs linearly; partitions 0–10 and 12–15 are flat at ~0. The instance owning partition 11 logs the same business-rule exception — `IllegalStateException: invoice already finalized` — on the same offset, thousands of times; the handler catches it, logs, does NOT advance the offset, and retries the *same* record on the next poll. CPU on that one instance is moderate. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.