Question
An internal admin portal starts failing logins at 11:00 with 'No space left on device' whenever it writes a new file-backed session, but `df -h` shows the data volume at 47% used — gigabytes free. The host has been up 90 days. `df -i` shows the volume at 100% *inodes*. The session directory holds tens of millions of ~200-byte files. A cleanup cron exists but `find /var/sessions -mmin +120 -delete` has been running for hours without finishing, and load is climbing. There was no deploy; sign-up traffic has been slowly growing for months. How do you triage and recover safely?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.