Inference Queue Backlog

Why buffering too many LLM requests makes things worse for everyone.

The idea

GPU Inference is slow. If 100 users hit your LLM at once, the GPU can only process a few at a time. The rest wait in an internal Queue. But what if the queue gets too long? If a request sits in the queue for 15 seconds, the user's browser HTTP request will time out, and the user will refresh the page. The user is gone, but the request is still in your queue. The GPU will eventually spend precious compute cycles answering a question for a user who isn't even listening anymore. This is Queue Backlog Poisoning.

Step 1: Normal load. The queue is short, users get answers in 2 seconds.

How it works (Load Shedding)

To survive traffic spikes, an inference server must aggressively protect its queue. If the queue gets too long, you must institute Load Shedding—immediately rejecting new incoming requests with an HTTP 503 (Service Unavailable) rather than letting them pile up. Furthermore, before the GPU starts processing a request from the queue, it should check if the client has already disconnected (e.g., via checking if the TCP socket is still open) and drop the request if they have.

// 1. Aggressive Load Shedding
function enqueueRequest(req) {
    if (queue.length > MAX_QUEUE_DEPTH) {
        // DO NOT add to queue. Fail immediately.
        return res.status(503).send("Too many requests, try again.");
    }
    queue.push({ req, timestamp: Date.now() });
}

// 2. Drop disconnected clients
function processNextItem() {
    let item = queue.shift();
    
    // Check if the user's browser already closed the connection
    if (item.req.socket.destroyed) {
        console.log("Client disconnected, skipping generation.");
        return processNextItem(); // Skip this one!
    }
    
    gpu.generate(item.prompt);
}

Cost

Returning a 503 error is a terrible user experience, but it is infinitely better than a cascading failure. If you don't shed load, the queue grows infinitely, latency spikes to minutes, every user hits a timeout, and 100% of your expensive GPU compute is wasted generating text into the void. It's better to serve 80% of users perfectly and reject 20%, than to fail 100% of users.

Watch out for