On-callMediumoc-g170

Subject Partition skewLevel Mid–Senior~35 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

An IoT ingestion pipeline writes device telemetry to a Kinesis stream (8 shards), using `region_code` as the partition key. At 12:00 producers start getting `ProvisionedThroughputExceededException` (write throttling) on *some* records, but aggregate stream `IncomingBytes` is well under the 8-shard write budget (16 MB/s). The CloudWatch shard-level `WriteProvisionedThroughputExceeded` metric is high on exactly 2 of the 8 shards; the other 6 are nearly idle. Recent context: most traffic comes from two `region_code` values (`us-east`, `eu-west`) covering ~80% of devices. Triage and mitigate the write throttling despite spare aggregate capacity.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.