On-callMediumoc-g116

Subject Poison messageLevel Mid–Senior~30 minCommon in Distributed systems interviewsIndustries Technology

Question

An events-ingestion consumer reading from an ordered partition (Kinesis shard) stops making progress at 11:40. Dashboards: `IteratorAge` for one shard climbs linearly to 45 minutes while the other shards are healthy; the consumer logs a tight loop of the same `JsonParseException: Unexpected character ('\x00')` every few hundred ms; CPU on that worker is at 100%; no records are being checkpointed for that shard. A new upstream producer SDK version rolled out an hour ago. How do you triage, unblock the pipeline, and avoid losing the records behind the bad one?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.