Question
Your LLM inference pods serve variable-length prompts on A100s and stay up for days. A pattern emerges: pods run fine for ~30-40 hours, then start throwing intermittent CUDA out-of-memory errors and rejecting some requests, even though average GPU memory usage and RPS are unchanged from when they were healthy. Dashboards: reported GPU memory 'in use' sits around 70%, but allocation failures climb over time; the OOMs correlate with pod age (older pods fail more), and restarting a pod fully resolves it for another ~35 hours. Request length distribution is highly variable (short and very long prompts interleaved). No deploy, no traffic change. How do you triage and address this?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.