On-callHardoc-g175

Subject Message lossLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A KCL consumer processes a Kinesis stream and writes to a downstream warehouse. After a worker fleet incident this morning (several KCL workers OOM-killed and restarted repeatedly for ~20 minutes), reconciliation finds a window of records that never reached the warehouse — a gap. The Kinesis stream's retention is 24h and the data is still within retention. Dashboards: during the incident, `IteratorAgeMilliseconds` briefly spiked then recovered; the DynamoDB lease table shows checkpoints that advanced during the crash window. The consumer code checkpoints *once per batch immediately on receiving records*, before the warehouse write. Triage, explain the gap, and how do you recover the lost data?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.