Batch Scoring Nulls (ML Pipelines)

When a single missing value ruins predictions for an entire batch.

The idea

In Machine Learning, Batch Scoring is when you predict outcomes for thousands of users at once (e.g., a nightly job to predict "Will this user churn tomorrow?"). A common bug happens when a data pipeline fails upstream, resulting in a NULL value for a feature (like last_login_date). If your ML model isn't trained to handle NULLs gracefully, it might throw an error and crash the entire batch job, or worse, silently output garbage predictions (like predicting 100% churn probability) for everyone who had a missing feature.

Step 1: A nightly Batch Job runs 3 users through an ML model to predict churn.

How it works (Imputation & Defaults)

To make Batch Scoring robust, you must implement Imputation. Before the data hits the model, the pipeline intercepts NULL values and replaces them with a safe default. This could be the Mean of the column, the Median, or a designated Missing Flag (like -999) that the model was specifically trained to understand.

# The unsafe way (Crashes if age is NULL)
df['prediction'] = model.predict(df[['age', 'purchases']])

# The safe way (Imputation)
# 1. Fill missing ages with the median age (e.g., 34)
# 2. Fill missing purchases with 0
df['age'] = df['age'].fillna(df['age'].median())
df['purchases'] = df['purchases'].fillna(0)

# Now it is safe to score the batch
df['prediction'] = model.predict(df[['age', 'purchases']])

Cost

Imputation is a band-aid. If a data pipeline breaks and 90% of your users suddenly have NULL purchases, imputing "0" for all of them will allow the batch job to succeed, but the model's predictions will be completely inaccurate. Your business might automatically email 90% of your users a "We miss you!" discount code, losing thousands of dollars. Imputation must be paired with strict Data Quality monitoring.

Watch out for

Training-Serving Skew: If you impute NULL with the Median during training, but you impute it with 0 during Batch Scoring in production, your model will behave unpredictably. The imputation logic must be perfectly synced (or saved as part of the model artifact itself using something like a Scikit-Learn Pipeline).