Code Room
System designHard
Question
Design an online model-serving platform that hosts thousands of heterogeneous models (small CPU classifiers up to large GPU transformers) for many internal teams, with per-model latency SLOs from 10ms to 500ms and bursty traffic. The platform must support safe rollouts, multi-version traffic splitting, autoscaling to zero for cold models, and not let one noisy tenant starve another. Cover the serving architecture, how you schedule across CPU/GPU, and the rollout/observability story.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.