Question
Finance reconciliation finds ~0.05% of `payment-captured` events never reached the analytics warehouse — gaps with NO matching Kafka record at all (not a consumer gap). No consumer lag, no consumer errors. Dashboards: consumers process every record that exists. The producer side shows a small but nonzero `record-error-rate` and a `buffer-available-bytes` that periodically dips to ~0 during traffic peaks; the producer was recently tuned for throughput — `acks=1`, `linger.ms=20`, `max.in.flight.requests.per.connection=5`, `retries` left at a low value, and `delivery.timeout.ms` fairly short. Brokers had a brief leader election yesterday during the peak window where the gaps cluster. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.