System designHardsd-g365

Subject Model servingLevel Senior–Staff~45 minCommon in ML systems interviewsIndustries Technology, Software development

Question

Design the GPU-sharing and cold-start layer for a platform that hosts thousands of customers' fine-tuned models, where most models are idle most of the time but must respond within ~2s when called, and you can't afford a dedicated GPU per model. Walk through how you pack many models onto shared GPUs, how you handle a request for a model that's currently unloaded (cold start), and how you isolate tenants so one customer's traffic or a bad model can't starve or crash others.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.