Flagging Bad Inputs (ML)

Teaching your ML model to explicitly recognize when data is missing or broken.

The idea

When data is missing (e.g., a user's age is NULL), the standard practice is to Impute it by filling it with the average (e.g., Age 34). However, the fact that the data is missing might actually be a highly predictive signal! For example, users who refuse to provide their age might be 3x more likely to churn. If you simply replace their NULL with 34, you destroy that signal. Flagging Bad Inputs means you create a brand new boolean column (e.g., age_is_missing) and feed that to the model alongside the imputed value.

Step 1: Raw data arrives with a missing value (NULL).

How it works (Indicator Variables)

In pandas or SQL, you create an "Indicator Variable". It is 1 if the original value was missing/broken, and 0 otherwise. This allows a decision tree (like XGBoost) to split on the indicator variable, learning a completely different set of rules for users who have missing data versus users who genuinely happen to be 34 years old.

# 1. Create the explicit Flag (Indicator Variable)
df['age_is_missing'] = df['age'].isnull().astype(int)

# 2. Impute the original column so it doesn't crash the model
median_age = df['age'].median()
df['age'] = df['age'].fillna(median_age)

# The model now receives BOTH columns:
# User A: age=32, age_is_missing=0
# User B: age=34, age_is_missing=1  <-- Model knows it was imputed!

Cost

Creating an indicator variable for every single feature in your dataset doubles the number of columns (width) of your dataset. This consumes twice as much RAM during training and inference. You usually only want to create flags for features where "missingness" is actually biologically or behaviorally relevant.

Watch out for

Out-of-bound Defaults: An older, simpler technique was to fill NULLs with an impossible value, like -999. While this works for Tree-based models (which just see it as a split point), it completely destroys Linear Regression or Neural Networks, which will multiply weights by -999 and skew the entire prediction space. Indicator Variables are much safer across all model architectures.