On-callHardoc-g659

Subject Feature pipeline online offline skewLevel Senior–Staff~35 minCommon in ML systems · Databases & SQL · Reliability & on-call interviewsIndustries Technology

Question

Your ad-ranking service runs an online/offline feature-skew monitor that compares the live feature vector logged at serving time against the same feature recomputed by the offline training pipeline. At 09:00 the monitor fires: for the feature 'ctr_30d_smoothed', the online and offline values now diverge for ~40% of impressions (online values are systematically lower), where yesterday they matched within tolerance. The model serves 200s, latency is normal, and revenue-per-impression has started drifting down. Dashboards: the skew started right after an overnight offline pipeline change that switched the smoothing prior from a fixed constant to a category-level Bayesian prior; the online serving service still uses the old fixed-constant smoothing. No serving deploy. How do you triage and respond?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.