On-callMediumoc-g179

Subject Disk fullLevel Mid–Senior~30 minCommon in Storage & CDN interviewsIndustries Software development

Question

PagerDuty fires: a Node.js service's host hits 100% disk on / at 03:40 and writes start failing with ENOSPC, causing 500s. You SSH in and `df -h` confirms / is full, but `du -sh /var/log/*` and `du -sh /*` only account for ~40% of the disk — the numbers don't add up. The service logs verbosely to a file that was rotated by logrotate at 03:00. There was a debug flag flipped on yesterday to investigate a separate bug, raising log volume ~10x. How do you triage and recover?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.