When a single missing value ruins predictions for an entire batch.
In Machine Learning, Batch Scoring is when you predict outcomes for thousands of users at once (e.g., a nightly job to predict "Will this user churn tomorrow?"). A common bug happens when a data pipeline fails upstream, resulting in a NULL value for a feature (like last_login_date). If your ML model isn't trained to handle NULLs gracefully, it might throw an error and crash the entire batch job, or worse, silently output garbage predictions (like predicting 100% churn probability) for everyone who had a missing feature.
To make Batch Scoring robust, you must implement Imputation. Before the data hits the model, the pipeline intercepts NULL values and replaces them with a safe default. This could be the Mean of the column, the Median, or a designated Missing Flag (like -999) that the model was specifically trained to understand.
# The unsafe way (Crashes if age is NULL)
df['prediction'] = model.predict(df[['age', 'purchases']])
# The safe way (Imputation)
# 1. Fill missing ages with the median age (e.g., 34)
# 2. Fill missing purchases with 0
df['age'] = df['age'].fillna(df['age'].median())
df['purchases'] = df['purchases'].fillna(0)
# Now it is safe to score the batch
df['prediction'] = model.predict(df[['age', 'purchases']])
Imputation is a band-aid. If a data pipeline breaks and 90% of your users suddenly have NULL purchases, imputing "0" for all of them will allow the batch job to succeed, but the model's predictions will be completely inaccurate. Your business might automatically email 90% of your users a "We miss you!" discount code, losing thousands of dollars. Imputation must be paired with strict Data Quality monitoring.
NULL with the Median during training, but you impute it with 0 during Batch Scoring in production, your model will behave unpredictably. The imputation logic must be perfectly synced (or saved as part of the model artifact itself using something like a Scikit-Learn Pipeline).