Target Leakage

When your model cheats on the final exam by looking at the answer key.

The idea

Imagine training an AI to predict if a patient has pneumonia from their medical records. During training, the model achieves 99.9% accuracy. You deploy it to the hospital, and it fails completely. Why? In your training data, you accidentally included a column called "given_antibiotics". The model learned a simple rule: "If they were given antibiotics, predict Pneumonia." This is Target Leakage. The model used a feature that was created as a result of the pneumonia. In the real world, when a new patient walks in the door, they haven't been given antibiotics yet, so the model has no idea what to do.

Step 1: The Data. We want to predict if a patient has Pneumonia (Target).

How it works (The Timeline Rule)

Target Leakage happens when your training data includes information that will not be available at the exact moment you need to make the prediction in production. To prevent it, you must strictly enforce a timeline. If you are predicting an event at Time T, you must aggressively delete any column from your dataset that was recorded after Time T.

import pandas as pd

df = pd.read_csv("hospital_records.csv")

# We want to predict 'has_pneumonia' when they walk in the door.
target = df['has_pneumonia']

# BAD: These features are created AFTER the doctor diagnoses them!
# The model will 'cheat' by using these clues.
leaky_features = ['given_antibiotics', 'days_in_icu', 'billing_code_482']

# GOOD: Drop the leaky columns before training
clean_df = df.drop(columns=leaky_features)

# Now the model is forced to learn from legit symptoms
X = clean_df[['fever', 'cough', 'age']]
model.fit(X, target)

Cost

Target leakage is the single most common reason why Machine Learning projects fail in production. It wastes months of engineering time because the model looks incredibly successful during testing. The only "cost" to fixing it is that your model's accuracy will drop from a fake 99% to a realistic 85%. You must accept the lower, honest score.

Watch out for

Subtle Leakage: Leakage isn't always obvious. If you are predicting who will default on a loan, you might accidentally include "late_fees_paid". A user only pays late fees after they start defaulting. You must drop that column.