Question
Design a model CI/CD and deployment system that lets ~150 teams promote models from training to production safely. A new model version must be validatable offline (eval metrics, fairness/slice checks), then run in shadow against live traffic without affecting users, then roll out via canary with automatic rollback on metric regression, all without redeploying the serving binary. Serving handles 100k req/sec; deploys happen dozens of times a day across teams. You must support fast rollback and keep a clear audit trail of which model version produced which prediction.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.