On-callHardoc-g564

Subject Inference gpu oomLevel Senior–Staff~35 minCommon in ML systems interviewsIndustries Technology

Question

Your image-captioning inference service runs on A10 GPUs behind a queue. At 09:20 the pod restart rate spikes: replicas are CUDA-OOM-killing and getting OOMKilled by the GPU memory limit, then the queue backs up while they reload the model. Dashboards show requests per second is normal but the p95 input token/image-size distribution shifted right an hour ago, GPU memory now sawtooths up to the limit before each crash, and successful-request rate is dropping. No code deploy in two days; a partner started sending high-resolution images this morning. How do you triage and stabilize?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.