Question
A state-machine service consumes a Kafka topic where per-key event order matters (`account_id` key → `created`, `updated`, `closed`). Support finds a few accounts where a `closed` event was applied *before* an `updated` event for the same key, leaving stale state. Consumers are fine and in order. Investigation points at the *producer*: it's configured `acks=all`, `retries=10`, `max.in.flight.requests.per.connection=5`, and `enable.idempotence=false`. Dashboards show a brief network blip to the broker ~around the affected timestamps causing some produce retries. Triage and explain how events got out of order on the same partition.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.