When your ML model accidentally memorizes the future.
Machine Learning models are trained to predict the future based on the past. But when you build your training dataset in a data warehouse (like Snowflake), it is incredibly easy to accidentally run a SQL JOIN that attaches a piece of data from after the prediction event happened. This is called Data Leakage. The model learns this "cheat code" and gets 99% accuracy in training, but completely fails in real production because that future data doesn't exist yet.
To prevent leakage, you must use an AS OF join (or Point-in-Time join). When joining your "Events" table to your "User Profile" table, you can't just join on user_id. You must join on user_id AND ensure the profile data timestamp is strictly less than the event timestamp.
-- BAD: Data Leakage (Standard Join)
-- Attaches the CURRENT total_purchases to a PAST event.
SELECT e.event_date, e.user_id, p.total_purchases, e.will_buy
FROM events e
JOIN profiles p ON e.user_id = p.user_id;
-- GOOD: Point-in-Time Join (AS OF)
-- Only attaches the total_purchases AS THEY WERE on the event date.
SELECT e.event_date, e.user_id, p.total_purchases, e.will_buy
FROM events e
ASOF JOIN profiles p
MATCH_CONDITION (e.event_date >= p.updated_at)
ON e.user_id = p.user_id;
Performing AS OF joins is computationally expensive. Standard databases use Hash Joins which are fast. Point-in-Time joins often require sorting the massive events and profiles tables by timestamp and doing a rolling merge-join. This can make your offline training pipeline run 10x slower and cost significantly more in cloud compute.