Code Room
On-callHardoc-g198
Subject Inode exhaustionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

An image-resizing service that caches every derived thumbnail as a separate file under /var/cache/thumbs starts failing with ENOSPC on writes at 09:30, yet `df -h` on that volume shows only 60% bytes used. Resized images are tiny (a few KB) and the cache has accumulated tens of millions of them across a flat directory structure since the cache-cleanup cron was accidentally disabled in a refactor three weeks ago. A campaign this morning drove a surge of new unique image-variant requests. Beyond the write failures, directory listings and lookups on that path have become very slow. Triage and remediate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.