Question
Design an ML feature-monitoring and observability system that watches features and predictions across ~200 production models. For each model you must detect, within minutes-to-hours, when an input feature's distribution shifts, when a feature pipeline silently starts emitting nulls or stale values, and when prediction distributions drift relative to training. The system ingests roughly 500k predictions/sec in aggregate, must support both real-time alerting and historical investigation, and most models have delayed or no ground-truth labels, so you can't rely on accuracy as the primary signal.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.