Question
Your ad-ranking service runs an online/offline feature-skew monitor that compares the live feature vector logged at serving time against the same feature recomputed by the offline training pipeline. At 09:00 the monitor fires: for the feature 'ctr_30d_smoothed', the online and offline values now diverge for ~40% of impressions (online values are systematically lower), where yesterday they matched within tolerance. The model serves 200s, latency is normal, and revenue-per-impression has started drifting down. Dashboards: the skew started right after an overnight offline pipeline change that switched the smoothing prior from a fixed constant to a category-level Bayesian prior; the online serving service still uses the old fixed-constant smoothing. No serving deploy. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.