GPU Memory Fragmentation

Why LLM inference engines used to waste 50% of their GPU memory, and how PagedAttention fixed it.

The idea

When an LLM generates text, it needs to remember the history of the conversation. It stores this in GPU memory as the "KV Cache". Because the engine doesn't know in advance if the user's prompt will result in a short 10-token answer or a massive 1,000-token essay, older engines pre-allocated a massive chunk of contiguous GPU memory for the maximum possible length just to be safe. If the model only generated 10 tokens, 99% of that reserved memory was wasted. This is Memory Fragmentation, and it limited GPUs to serving only a few users at a time.

Step 1: Old way. The engine pre-allocates a huge chunk of contiguous memory for every request, wasting massive amounts of space.

How it works (PagedAttention / vLLM)

Inspired by how Operating Systems manage RAM, researchers created PagedAttention (popularized by vLLM). Instead of allocating one massive contiguous block, the KV cache is chopped into tiny fixed-size "pages" (e.g., holding 16 tokens). As the LLM generates tokens, it dynamically requests one page at a time. The pages can be scattered randomly across the GPU memory, eliminating fragmentation entirely and allowing 2x-4x more concurrent users on the same hardware.

// 1. Old Way (Pre-allocation)
// Must find a contiguous block of 2048 slots. Fails if fragmented.
KVCache_Buffer = gpu.mallocContiguous(MAX_TOKENS_PER_SEQ * byteSize);

// 2. PagedAttention (Dynamic Paging)
// We only allocate what we need, exactly when we need it.
let logicalBlocks = [];
while (generatingTokens) {
    if (currentTokenBlock.isFull()) {
        // Grab any random free page from the GPU memory pool
        let newPage = gpu_page_pool.allocatePage(16); 
        logicalBlocks.push(newPage);
    }
    logicalBlocks.last().append(token);
}

Cost

PagedAttention requires a "Block Table" (a mapping of logical token sequence to physical GPU memory addresses). During matrix multiplication, the GPU kernel has to constantly look up these addresses to find where the scattered pages actually live. This lookup adds a tiny amount of computational overhead to the Attention mechanism, but the massive increase in batch size (throughput) more than makes up for it.

Watch out for

Prefix Caching: Because PagedAttention uses pages, if two users send prompts that share the exact same System Prompt ("You are a helpful assistant..."), the engine can actually map both users to the exact same physical page in GPU memory, saving even more VRAM. This is called Prompt Caching, and it's a direct benefit of a paged architecture.