Question
A `clickstream` Kafka consumer group fell behind during a long outage of its downstream warehouse (the consumer was effectively paused for ~14 hours while the warehouse was down). The topic `clickstream` is configured with `retention.ms` = 12 hours (`cleanup.policy=delete`). When the consumer resumes at 08:00, it starts processing again with no errors and no lag alerts clearing the way you'd expect — and reconciliation later finds a ~2-hour window of events that were NEVER delivered to the warehouse, even though the consumer 'caught up.' Dashboards: during resume, the consumer logged `OffsetOutOfRangeException` on several partitions, and `auto.offset.reset=latest`. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.