Time Series Validation

Why randomly splitting data will make you lose millions in the stock market.

The idea

In standard Machine Learning, you shuffle your data and randomly split 80% for training and 20% for testing. If you are predicting pictures of dogs, this works great. But if you are predicting Time Series data (like the stock market, or tomorrow's weather), random shuffling breaks the universe. You might train on data from December, and test on data from January of the same year. The model will "predict" January's weather by looking into the future (December). To prevent this time-travel cheating, we must use Chronological Splitting (also called Walk-Forward Validation).

Step 1: We have 6 months of stock market data (Jan through Jun).

How it works (Walk-Forward Validation)

Instead of shuffling, we strictly respect time. We train on the past and predict the immediate future. Then, we slide the window forward: we add that future data to our training set, and predict the next future block. The Test set must always, strictly, be chronologically after the Training set.

from sklearn.model_selection import TimeSeriesSplit

# Data is sorted strictly by date: Jan, Feb, Mar, Apr...
tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(data):
    # Iteration 1: Train on Jan. Test on Feb.
    # Iteration 2: Train on Jan+Feb. Test on Mar.
    # Iteration 3: Train on Jan+Feb+Mar. Test on Apr.
    
    X_train = data[train_index]
    X_test = data[test_index]
    
    model.fit(X_train, y_train)
    score = evaluate(model.predict(X_test), y_test)
    
# Notice: We NEVER train on Mar and test on Feb.

Cost

Time Series Validation gives you much less training data for your early folds (e.g. Iteration 1 only trains on January). This makes the first few models highly unstable. It is also computationally expensive, because you have to retrain the model multiple times as the window "walks forward" through time.

Watch out for

Data Leakage via Imputation: If you have missing data (e.g. a missing stock price), a common ML trick is to fill it with the "Mean (Average) of the column". If you take the average of the entire dataset (Jan-Jun) and use it to fill a missing value in February, you just leaked future information from June into February! Always calculate the mean using only the training window.