On-callMediumoc-g408

Subject Disk fullLevel Mid–Senior~30 minCommon in Storage & CDN interviewsIndustries Technology, Software development

Question

An upload service's hosts start returning 500s at 16:00 with ENOSPC; `df -h` shows /tmp's volume at 100%. `du -sh /tmp/*` reveals millions of `upload-*.part` files totaling the whole volume, with timestamps spread over the last three weeks. The service streams multipart uploads to a `.part` spool file, then renames it into place on success and is supposed to `unlink` it on failure. A deploy three weeks ago added an early `return` in the request handler's error path that skips the cleanup `unlink`. Upload error rate has been a steady low single-digit percent the whole time. Triage and recover.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.