Question
It's the night before a quarterly model-refresh deadline. The distributed training job for the next ranking model has crashed three times in the last hour: each run gets ~60% through an epoch, then a worker hits CUDA OOM and the whole job dies, losing progress. Dashboards: per-GPU memory climbs across the run rather than staying flat, the dataset grew ~30% this quarter, and a recent commit added two new high-cardinality embedding features. There's no checkpoint newer than the start of the failing epoch. How do you get a good model trained before the deadline, and what do you fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.