Training Job OOM (Out of Memory)

Why you can't just load a 100-Gigabyte dataset into a 16-Gigabyte RAM chip.

The idea

When training a Machine Learning model in Python, the easiest thing to do is read the entire CSV file into memory using Pandas, and pass that massive variable into model.fit(). This works fine for small datasets. But if your company has 100GB of historical data, and your server only has 16GB of RAM, Python will immediately crash with an OOM (Out of Memory) error. You physically cannot hold the entire dataset in RAM at once.

Step 1: The database contains 100GB of image data. The GPU only has 16GB of VRAM.

How it works (Data Generators & Batching)

To fix OOM crashes, we stop loading the whole file at once. Instead, we use Data Generators (or streams). A generator opens a connection to the hard drive, reads a small "Batch" of 32 images into RAM, passes them to the GPU for training, and then immediately deletes them from RAM to make space for the next 32 images. The GPU never sees the whole dataset at once.

import tensorflow as tf

# BAD: Loads 1,000,000 images into RAM all at once. Crash!
# X = load_all_images("data/") 
# model.fit(X, y)

# GOOD: Create a Dataset generator. 
# It reads directly from the hard drive lazily.
dataset = tf.data.Dataset.list_files("data/*.jpg")

def process_path(file_path):
    # Load and decode a single image
    return tf.io.decode_jpeg(tf.io.read_file(file_path))

dataset = dataset.map(process_path)

# Group into batches of 32. RAM never exceeds 32 images at a time!
dataset = dataset.batch(32)

# The model pulls batches one by one during training
model.fit(dataset, epochs=10)

Cost

Using a Data Generator solves the RAM limit, but it creates a massive I/O (Input/Output) bottleneck. Reading from a hard drive is thousands of times slower than reading from RAM. Your extremely expensive GPU will spend 90% of its time sitting idle, waiting for the CPU to fetch the next 32 images from the slow hard drive. You trade a memory crash for a severe speed penalty.

Watch out for