On-callHardoc-g379

Subject Ordering violationLevel Senior–Staff~40 minCommon in Distributed systems · Algorithms & data structures interviewsIndustries Technology

Question

A wallet-balance service consumes a Kafka topic `wallet-events` keyed by `user_id`, applying `credit`/`debit`/`adjust` events that MUST be processed in per-user order. Last night, to add headroom, ops increased the topic's partition count from 12 to 24 (online, via `kafka-topics --alter`). This morning support reports a small number of users whose balance briefly went wrong before self-correcting, and one user is stuck in an inconsistent state. Dashboards: no lag, no consumer errors, no redeliveries — consumers look perfectly healthy. The producer uses the default partitioner (hash of key mod partition count). How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.