Question
A trade-position service consumes a Kinesis stream keyed by `account_id`. Support reports a handful of accounts show a position that briefly went *negative* then corrected — an out-of-order apply of `debit`/`credit` events that are supposed to be strictly ordered per account. Dashboards: `IteratorAgeMilliseconds` is low; no lag. Recent context: ops increased the stream from 8 to 16 shards (a reshard / shard split) at 14:00 to handle growth, and the KCL consumer fleet scaled from 8 to 16 workers around the same time. The affected accounts are all ones whose `account_id` hash landed near a split boundary. Triage and explain the ordering violation.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.