On-callHardoc-g112

Subject Data lossLevel Senior–Staff~40 minCommon in Databases & SQL · Concurrency · Reliability & on-call · Distributed systems interviewsIndustries Technology

Question

Customer support escalates: ~40 minutes of order events appear to be missing from the downstream warehouse, but only for one product line. The producing service writes to a Kafka topic (`acks=1`, `min.insync.replicas` not set, RF=3). Dashboards show: a broker (the leader for 6 of 24 partitions) crashed and was replaced 45 minutes ago; `UnderReplicatedPartitions` spiked then recovered; consumer lag is zero; producer error rate stayed flat the whole time; the missing orders all hash to partitions whose leader was the crashed broker. No producer retries fired. How do you confirm the loss, mitigate now, and recover the lost orders?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.