Question
A KCL consumer processes a Kinesis stream and writes to a downstream warehouse. After a worker fleet incident this morning (several KCL workers OOM-killed and restarted repeatedly for ~20 minutes), reconciliation finds a window of records that never reached the warehouse — a gap. The Kinesis stream's retention is 24h and the data is still within retention. Dashboards: during the incident, `IteratorAgeMilliseconds` briefly spiked then recovered; the DynamoDB lease table shows checkpoints that advanced during the crash window. The consumer code checkpoints *once per batch immediately on receiving records*, before the warehouse write. Triage, explain the gap, and how do you recover the lost data?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.