On-callHardoc-g663

Subject Feature pipeline batch scoring nullsLevel Senior–Staff~35 minCommon in ML systems interviewsIndustries Technology

Question

Your nightly batch-scoring job writes a 'propensity_to_churn' score for every active account into a table the CRM reads to trigger retention campaigns. This morning the marketing ops team reports retention emails went out to almost nobody. Dashboards: the batch-scoring job ran on schedule and reported success (exit 0), wrote the expected ~4M rows, and emitted no errors — but ~85% of the 'propensity_to_churn' values written are NULL (yesterday it was <1% null). The model artifact loaded fine. Investigation shows an upstream feature table 'last_login_ts' was repartitioned yesterday and a column was retyped from timestamp to string, so a downstream cast now returns null for most rows and the model's preprocessing silently maps the failed feature to a null score. How do you triage and respond?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.