Question
Customer support escalates: ~40 minutes of order events appear to be missing from the downstream warehouse, but only for one product line. The producing service writes to a Kafka topic (`acks=1`, `min.insync.replicas` not set, RF=3). Dashboards show: a broker (the leader for 6 of 24 partitions) crashed and was replaced 45 minutes ago; `UnderReplicatedPartitions` spiked then recovered; consumer lag is zero; producer error rate stayed flat the whole time; the missing orders all hash to partitions whose leader was the crashed broker. No producer retries fired. How do you confirm the loss, mitigate now, and recover the lost orders?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.