Question
An append-only event log service writes newline-delimited JSON records to a local segment file, then a separate compactor ships sealed segments to object storage. At 02:40 the host's data disk briefly hit 100% (a runaway log filled it, then logrotate freed it 6 minutes later). No process crashed. This morning a downstream consumer reports ~0.05% of records fail to parse, and a few that *do* parse have a truncated tail field merged into the next record's first field. Dashboards: write success rate was 100% the whole time (the writer treats a short write as success), disk now at 60%. How do you triage, stop the bad segments from propagating, and recover the data?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.