Code Room
System designHard
Question
Design the GPU-sharing and cold-start layer for a platform that hosts thousands of customers' fine-tuned models, where most models are idle most of the time but must respond within ~2s when called, and you can't afford a dedicated GPU per model. Walk through how you pack many models onto shared GPUs, how you handle a request for a model that's currently unloaded (cold start), and how you isolate tenants so one customer's traffic or a bad model can't starve or crash others.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.