On-callHardoc-g568

Subject Ml training job oomLevel Senior–Staff~35 minCommon in ML systems interviewsIndustries Technology

Question

It's the night before a quarterly model-refresh deadline. The distributed training job for the next ranking model has crashed three times in the last hour: each run gets ~60% through an epoch, then a worker hits CUDA OOM and the whole job dies, losing progress. Dashboards: per-GPU memory climbs across the run rather than staying flat, the dataset grew ~30% this quarter, and a recent commit added two new high-cardinality embedding features. There's no checkpoint newer than the start of the failing epoch. How do you get a good model trained before the deadline, and what do you fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.