Code Room
On-callHardoc-g158
Subject Message lossLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

Finance reports that ~0.2% of `ledger-events` are missing from the downstream warehouse — reconciliation finds gaps with no corresponding Kafka record. No consumer lag, no errors in consumer logs, exactly-once is NOT enabled. Dashboards: the `ledger-events` topic has `replication.factor=3` but `min.insync.replicas=1`; broker logs show two unclean leader elections in the last 24h during a rolling broker restart for an OS patch. The producer config is `acks=1`, `retries=0`. The gaps correlate in time with the broker restarts. Triage and explain the loss.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.