System designHardsd-g093

Subject Training pipelineLevel Senior–Staff~50 minCommon in ML systems · Concurrency · Distributed systems interviewsIndustries Technology

Question

Design the training infrastructure for a large language model trained across 1024 GPUs for several weeks on a multi-petabyte tokenized corpus. Hardware failures are routine at this scale (expect a GPU/node to die every few hours). Design the system to make a multi-week run survivable: cover the data pipeline feeding 1024 workers without becoming the bottleneck, the parallelism strategy, and how checkpointing + recovery keeps a single node failure from wasting the whole run.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.