Code Room
System designHard
Question
Design the fault-tolerance and checkpointing layer for a distributed training job that runs for two weeks across 512 GPUs, where at that scale a hardware failure (a GPU falling off, a node crashing, a network blip) is not an edge case but a near-certainty multiple times per run. A naive 'restart from scratch on any failure' wastes days. Walk through how you checkpoint efficiently, how you recover from a node failure without losing meaningful progress, and how you keep checkpointing from itself becoming a bottleneck that stalls the GPUs.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.