Code Room
On-callMedium
Question
PagerDuty fires: a Node.js service's host hits 100% disk on / at 03:40 and writes start failing with ENOSPC, causing 500s. You SSH in and `df -h` confirms / is full, but `du -sh /var/log/*` and `du -sh /*` only account for ~40% of the disk — the numbers don't add up. The service logs verbosely to a file that was rotated by logrotate at 03:00. There was a debug flag flipped on yesterday to investigate a separate bug, raising log volume ~10x. How do you triage and recover?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.