When your model cheats on the final exam by looking at the answer key.
Imagine training an AI to predict if a patient has pneumonia from their medical records. During training, the model achieves 99.9% accuracy. You deploy it to the hospital, and it fails completely. Why? In your training data, you accidentally included a column called "given_antibiotics". The model learned a simple rule: "If they were given antibiotics, predict Pneumonia." This is Target Leakage. The model used a feature that was created as a result of the pneumonia. In the real world, when a new patient walks in the door, they haven't been given antibiotics yet, so the model has no idea what to do.
Target Leakage happens when your training data includes information that will not be available at the exact moment you need to make the prediction in production. To prevent it, you must strictly enforce a timeline. If you are predicting an event at Time T, you must aggressively delete any column from your dataset that was recorded after Time T.
import pandas as pd
df = pd.read_csv("hospital_records.csv")
# We want to predict 'has_pneumonia' when they walk in the door.
target = df['has_pneumonia']
# BAD: These features are created AFTER the doctor diagnoses them!
# The model will 'cheat' by using these clues.
leaky_features = ['given_antibiotics', 'days_in_icu', 'billing_code_482']
# GOOD: Drop the leaky columns before training
clean_df = df.drop(columns=leaky_features)
# Now the model is forced to learn from legit symptoms
X = clean_df[['fever', 'cough', 'age']]
model.fit(X, target)
Target leakage is the single most common reason why Machine Learning projects fail in production. It wastes months of engineering time because the model looks incredibly successful during testing. The only "cost" to fixing it is that your model's accuracy will drop from a fake 99% to a realistic 85%. You must accept the lower, honest score.
"late_fees_paid". A user only pays late fees after they start defaulting. You must drop that column.