System designHardsd-g368

Subject Training pipelineLevel Senior–Staff~50 minCommon in ML systems · Distributed systems interviewsIndustries Technology, Software development

Question

Design the fault-tolerance and checkpointing layer for a distributed training job that runs for two weeks across 512 GPUs, where at that scale a hardware failure (a GPU falling off, a node crashing, a network blip) is not an edge case but a near-certainty multiple times per run. A naive 'restart from scratch on any failure' wastes days. Walk through how you checkpoint efficiently, how you recover from a node failure without losing meaningful progress, and how you keep checkpointing from itself becoming a bottleneck that stalls the GPUs.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.