Point-in-Time Leakage

When your ML model accidentally memorizes the future.

The idea

Machine Learning models are trained to predict the future based on the past. But when you build your training dataset in a data warehouse (like Snowflake), it is incredibly easy to accidentally run a SQL JOIN that attaches a piece of data from after the prediction event happened. This is called Data Leakage. The model learns this "cheat code" and gets 99% accuracy in training, but completely fails in real production because that future data doesn't exist yet.

Step 1: A user browses the site on Tuesday and makes a purchase on Wednesday.

How it works (Point-in-Time Joins)

To prevent leakage, you must use an AS OF join (or Point-in-Time join). When joining your "Events" table to your "User Profile" table, you can't just join on user_id. You must join on user_id AND ensure the profile data timestamp is strictly less than the event timestamp.

-- BAD: Data Leakage (Standard Join)
-- Attaches the CURRENT total_purchases to a PAST event.
SELECT e.event_date, e.user_id, p.total_purchases, e.will_buy
FROM events e
JOIN profiles p ON e.user_id = p.user_id;

-- GOOD: Point-in-Time Join (AS OF)
-- Only attaches the total_purchases AS THEY WERE on the event date.
SELECT e.event_date, e.user_id, p.total_purchases, e.will_buy
FROM events e
ASOF JOIN profiles p 
  MATCH_CONDITION (e.event_date >= p.updated_at)
  ON e.user_id = p.user_id;

Cost

Performing AS OF joins is computationally expensive. Standard databases use Hash Joins which are fast. Point-in-Time joins often require sorting the massive events and profiles tables by timestamp and doing a rolling merge-join. This can make your offline training pipeline run 10x slower and cost significantly more in cloud compute.

Watch out for