On-callHardoc-g664

Subject Inference gpu memory fragmentationLevel Senior–Staff~40 minCommon in ML systems interviewsIndustries Technology

Question

Your LLM inference pods serve variable-length prompts on A100s and stay up for days. A pattern emerges: pods run fine for ~30-40 hours, then start throwing intermittent CUDA out-of-memory errors and rejecting some requests, even though average GPU memory usage and RPS are unchanged from when they were healthy. Dashboards: reported GPU memory 'in use' sits around 70%, but allocation failures climb over time; the OOMs correlate with pod age (older pods fail more), and restarting a pod fully resolves it for another ~35 hours. Request length distribution is highly variable (short and very long prompts interleaved). No deploy, no traffic change. How do you triage and address this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.