Inference GPU OOM (Out of Memory)

Why a single long document can crash an entire AI server.

The idea

GPU memory (VRAM) is extremely limited (e.g., 24GB or 80GB). A large portion of this is permanently taken by the Model Weights. The remaining space is used for the "KV Cache", which stores the context of active user requests. The longer a user's prompt (e.g., pasting in a 50-page PDF), the more KV Cache memory it consumes. If 10 users all paste massive PDFs at the same time, the GPU instantly runs out of memory, throws a CUDA CUDA_ERROR_OUT_OF_MEMORY (OOM), and forcefully crashes the entire inference server, killing all active users.

Step 1: Normal. The model weights take 40GB. We have 40GB free for user prompts (KV Cache).

How it works (Continuous Batching & Max Tokens)

To prevent OOM crashes, an inference server must act like an aggressive nightclub bouncer. It must strictly limit the max_tokens a single user can submit. But more importantly, the server uses Continuous Batching. Instead of blindly accepting all requests, the server maintains an internal queue. It calculates exactly how much VRAM a new request will require. If the GPU is full, it forces new requests to wait in the queue until an existing user finishes and their memory is freed.

// Pseudocode of a safe Inference Server (e.g. vLLM)

function onNewRequest(promptText) {
    const numTokens = tokenize(promptText);
    
    // 1. Strict limit guardrail
    if (numTokens > MAX_CONTEXT_LENGTH) {
        return "Error: Payload Too Large";
    }
    
    // 2. Memory estimation
    const requiredMemory = estimateKVCacheSize(numTokens);
    
    // 3. Admission Control (Don't let them in if we're full!)
    if (gpu.freeMemory < requiredMemory) {
        requestQueue.push(promptText); // Wait your turn
    } else {
        allocateAndRun(promptText);
    }
}

Cost

Using a strict queue prevents OOM crashes, but it creates Latency Spikes. If the GPU is full processing a massive document, new users in the queue might have to wait 10 seconds before the AI even starts typing a response. You must heavily monitor your "Queue Depth" metric and autoscale new GPU servers before the queue gets too long.

Watch out for

Generation Length: It's not just the input prompt! As the model generates an answer word-by-word, the KV Cache grows. A request might fit perfectly when it starts, but OOM the server halfway through generating a 2,000 word essay. You must pre-calculate the memory for the maximum possible generation length before admitting the request.