Question
A wallet-balance service consumes a Kafka topic `wallet-events` keyed by `user_id`, applying `credit`/`debit`/`adjust` events that MUST be processed in per-user order. Last night, to add headroom, ops increased the topic's partition count from 12 to 24 (online, via `kafka-topics --alter`). This morning support reports a small number of users whose balance briefly went wrong before self-correcting, and one user is stuck in an inconsistent state. Dashboards: no lag, no consumer errors, no redeliveries — consumers look perfectly healthy. The producer uses the default partitioner (hash of key mod partition count). How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.