Question
Your nightly batch-scoring job writes a 'propensity_to_churn' score for every active account into a table the CRM reads to trigger retention campaigns. This morning the marketing ops team reports retention emails went out to almost nobody. Dashboards: the batch-scoring job ran on schedule and reported success (exit 0), wrote the expected ~4M rows, and emitted no errors — but ~85% of the 'propensity_to_churn' values written are NULL (yesterday it was <1% null). The model artifact loaded fine. Investigation shows an upstream feature table 'last_login_ts' was repartitioned yesterday and a column was retyped from timestamp to string, so a downstream cast now returns null for most rows and the model's preprocessing silently maps the failed feature to a null score. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.