On-callMediumoc-g399

Subject Inode exhaustionLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, IT services

Question

An internal admin portal starts failing logins at 11:00 with 'No space left on device' whenever it writes a new file-backed session, but `df -h` shows the data volume at 47% used — gigabytes free. The host has been up 90 days. `df -i` shows the volume at 100% *inodes*. The session directory holds tens of millions of ~200-byte files. A cleanup cron exists but `find /var/sessions -mmin +120 -delete` has been running for hours without finishing, and load is climbing. There was no deploy; sign-up traffic has been slowly growing for months. How do you triage and recover safely?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.