Code Room
On-callMedium
Question
An events-ingestion consumer reading from an ordered partition (Kinesis shard) stops making progress at 11:40. Dashboards: `IteratorAge` for one shard climbs linearly to 45 minutes while the other shards are healthy; the consumer logs a tight loop of the same `JsonParseException: Unexpected character ('\x00')` every few hundred ms; CPU on that worker is at 100%; no records are being checkpointed for that shard. A new upstream producer SDK version rolled out an hour ago. How do you triage, unblock the pipeline, and avoid losing the records behind the bad one?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.