ML Inference Serving (Dynamic Batching)

How to maximize expensive GPUs by forcing impatient HTTP requests to carpool.

The idea

Loading an ML model (like a Large Language Model) into a GPU is incredibly expensive. If you process incoming HTTP requests one by one, your GPU is mostly idle, waiting for network data to arrive. GPUs are designed to do thousands of math operations in parallel. To get your money's worth, you must use Dynamic Batching. Instead of sending requests to the GPU immediately, a queue holds them for a few milliseconds. It gathers a "batch" of requests (like passengers on a bus) and sends them to the GPU all at once. The GPU computes them simultaneously in exactly the same amount of time it takes to compute one.

Step 1: Unbatched Inference. 4 requests arrive. They are processed by the GPU one by one.

How it works (The Batching Queue)

A proxy layer sits in front of the model (often using frameworks like NVIDIA Triton or Ray Serve). When an HTTP request comes in, the proxy pauses the connection and puts the input tensor into a holding queue. It waits until either the batch is full (e.g., 8 requests) OR a maximum latency timeout hits (e.g., 20ms). Then it ships the matrix to the GPU.

// Conceptual Dynamic Batcher

class Batcher:
    def __init__(self):
        self.queue = []
        self.max_batch_size = 8
        self.timeout_ms = 20

    async def predict(self, input_data):
        # 1. Put user request in queue, get a Future promise back
        future = self.add_to_queue(input_data)
        
        # 2. Wait for the background worker to process the batch
        return await future

# Background Worker Thread:
# while True:
#    if len(queue) >= max_batch_size or time_since_first() > timeout_ms:
#        batch = queue.take_all()
#        results = gpu_model.predict(batch) # One massive parallel matrix multiply
#        resolve_futures(batch, results)

Cost

Dynamic Batching strictly adds latency to individual requests. The first request in the queue is artificially delayed by up to 20ms while waiting for friends to join the carpool. However, it increases the overall throughput (requests per second) of the server by 5x to 10x, drastically reducing the number of expensive GPU servers you need to rent.

Watch out for

Variable length inputs: In NLP (Text generation), batching is hard because users send sentences of different lengths. You have to "pad" the shorter sentences with zeros so they form a perfect rectangle (matrix) for the GPU. If one user sends a 1,000-word prompt, and 7 users send 5-word prompts, the GPU wastes massive amounts of memory processing the zeros padding the 5-word prompts.