Code Room
On-callHard
Question
A CI build host starts failing new builds with 'fork: retry: Resource temporarily unavailable' and SSH logins are refused, though CPU, memory, and disk all look fine. The host runs many short-lived containerized build steps. `ps -eLf | wc -l` shows ~30,000 tasks; `cat /proc/sys/kernel/pid_max` is 32768. A build job introduced last week spawns a per-test helper process but, on a code path hit by a flaky test, never reaps the children — `ps` shows thousands of <defunct> (zombie) processes parented to that job. Triage and fix.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.