Question
An upload service's hosts start returning 500s at 16:00 with ENOSPC; `df -h` shows /tmp's volume at 100%. `du -sh /tmp/*` reveals millions of `upload-*.part` files totaling the whole volume, with timestamps spread over the last three weeks. The service streams multipart uploads to a `.part` spool file, then renames it into place on success and is supposed to `unlink` it on failure. A deploy three weeks ago added an early `return` in the request handler's error path that skips the cleanup `unlink`. Upload error rate has been a steady low single-digit percent the whole time. Triage and recover.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.