ML system design

Serving massive machine learning models in production without melting the servers.

The idea

Running inference on a large AI model (like an LLM or deep neural net) is computationally heavy. If 100 users ask for predictions at the exact same millisecond, processing them 1-by-1 sequentially will cause a massive backlog.

To fix this, Model Servers use Dynamic Batching. They pause incoming requests for a few milliseconds to group them together. The GPU can process a batch of 8 requests almost as fast as 1 request! This trades a tiny bit of latency (the pause) for a massive increase in total throughput.

Incoming Traffic Load

Low traffic: Requests are processed 1-by-1 immediately.

How it works (Dynamic Batching)

def model_server_loop():
    while True:
        # Wait up to 10ms to collect multiple requests
        batch = wait_for_requests(max_batch_size=8, timeout_ms=10)
        
        if len(batch) > 0:
            # GPU processes 8 items almost as fast as 1 item!
            # Matrix multiplication loves large batches.
            predictions = gpu_model.predict(batch)
            
            # Send results back to individual users
            for req, pred in zip(batch, predictions):
                req.reply(pred)