k-Fold Cross-Validation

How to ensure your model didn't just get lucky on the test test.

The idea

When training an ML model, you usually split your data into a Training set (80%) and a Test set (20%). But what if, entirely by chance, the Test set happens to contain all the easiest examples? Your model will report 99% accuracy, but fail completely in the real world. To fix this, we use k-Fold Cross-Validation. Instead of splitting the data once, we split it into k equal chunks (e.g., 5 folds). We train the model 5 separate times, each time using a different chunk as the Test set and the other 4 chunks as the Training set. We then average the 5 scores together to get a robust, undeniable accuracy metric.

Step 1: The data is randomly shuffled and divided into k=5 equal "folds" (chunks).

How it works (The k-Fold loop)

Cross-validation is a diagnostic tool. You don't deploy the 5 models you train during this process. You use this process to confidently measure how good your model architecture and hyperparameters are. Once you are satisfied with the averaged score, you throw away the 5 test models and train one final model on 100% of the data.

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# 1. Prepare 5 splits
kf = KFold(n_splits=5, shuffle=True)
scores = []

# 2. Train 5 separate times
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = MyAwesomeModel()
    model.fit(X_train, y_train)           # Train on 4 folds
    predictions = model.predict(X_test)   # Test on 1 fold
    
    score = accuracy_score(y_test, predictions)
    scores.append(score)

# 3. Average the results
print(f"Average Accuracy: {sum(scores) / len(scores)}")

Cost

The time complexity is exactly O(k) times normal training. If training your neural network takes 1 hour, a 5-fold cross-validation will take 5 hours. Because of this massive time cost, Deep Learning models on massive datasets rarely use k-Fold. It is most commonly used on smaller tabular datasets (like Random Forests or XGBoost) where training takes seconds.

Watch out for