On-callHardoc-g169

Subject Ordering violationLevel Senior–Staff~40 minCommon in Networking & APIs · Distributed systems interviewsIndustries Technology, Software development

Question

A state-machine service consumes a Kafka topic where per-key event order matters (`account_id` key → `created`, `updated`, `closed`). Support finds a few accounts where a `closed` event was applied *before* an `updated` event for the same key, leaving stale state. Consumers are fine and in order. Investigation points at the *producer*: it's configured `acks=all`, `retries=10`, `max.in.flight.requests.per.connection=5`, and `enable.idempotence=false`. Dashboards show a brief network blip to the broker ~around the affected timestamps causing some produce retries. Triage and explain how events got out of order on the same partition.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.