Label Delay

When the ground truth arrives too late to train your machine learning model.

The idea

To train a Machine Learning model, you need Features (the inputs, like a user's browsing history) and Labels (the actual outcome, like "Did they click the ad?"). In many systems, you know the Features immediately, but you have to wait to find out the Label. For an ad-click, you only wait 5 seconds. But what if you are predicting "Will this credit card transaction result in a chargeback?" It can take 60 days for a bank to finalize a chargeback. This is called Label Delay.

Step 1: A user makes a transaction on January 1st.

How it works (Observation Windows)

Because of Label Delay, you cannot use data from the last 60 days to train your model, because you don't actually know if they are chargebacks yet. If you naively train your model on yesterday's data, it will assume 0% chargebacks and learn the wrong thing. You have to enforce a strict Observation Window and only train on data older than the delay period.

# The naive, incorrect way (Labels are immature!)
# We use all data up to today. 
training_data = get_data(end_date=datetime.today())

# The correct way (Accounting for Label Delay)
# If chargebacks take 60 days to settle, we must discard 
# the most recent 60 days of data from our training set.
safe_end_date = datetime.today() - timedelta(days=60)
training_data = get_data(end_date=safe_end_date)

model.fit(training_data)

Cost

By dropping the most recent 60 days of data, your model is always 2 months blind. If scammers invent a brand new type of credit card fraud today, your model won't even begin to see the training labels for it until two months from now. Your business absorbs losses during that entire blind spot.

Watch out for

Proxy Labels: To fix the blind spot, companies often invent "Proxy Labels". Instead of waiting 60 days for a real bank chargeback, they train a secondary, faster model to predict "Did the user complain to customer support within 24 hours?". It's not perfectly accurate, but it gives the system an immediate signal to fight rapidly evolving fraud.