System designHardsd-g354

Subject Model servingLevel Senior–Staff~45 minCommon in ML systems interviewsIndustries Technology, Software development

Question

Design a multi-model serving platform where a single ranking team runs ~30 model versions simultaneously: a stable champion, several challengers under canary, and per-segment variants. Each version has different latency/memory profiles; aggregate traffic is 40k QPS with a 30ms p99 budget. Requirements: route a configurable % of traffic to each version, automatically halt and roll back a canary if its error rate or latency regresses, and autoscale each version independently so a low-traffic challenger doesn't pin idle GPUs. Walk through the routing layer, the canary-promotion control loop, and how you autoscale per-version without thrashing.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.