Question
A `device-state` service consumes a single Kafka partition (per-device events are correctly ordered on the broker) and applies `connect`/`config`/`disconnect` transitions that must be applied in order per device. Support reports a few devices stuck 'connected' that should be 'disconnected.' Dashboards: no lag, no errors, no rebalances, no redeliveries — the consumer reads records strictly in order. Recent context: last week the consumer was 'optimized' by handing each polled record to a thread pool (`executor.submit(processRecord)`) so processing could parallelize. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.