Serving massive machine learning models in production without melting the servers.
Running inference on a large AI model (like an LLM or deep neural net) is computationally heavy. If 100 users ask for predictions at the exact same millisecond, processing them 1-by-1 sequentially will cause a massive backlog.
To fix this, Model Servers use Dynamic Batching. They pause incoming requests for a few milliseconds to group them together. The GPU can process a batch of 8 requests almost as fast as 1 request! This trades a tiny bit of latency (the pause) for a massive increase in total throughput.
def model_server_loop():
while True:
# Wait up to 10ms to collect multiple requests
batch = wait_for_requests(max_batch_size=8, timeout_ms=10)
if len(batch) > 0:
# GPU processes 8 items almost as fast as 1 item!
# Matrix multiplication loves large batches.
predictions = gpu_model.predict(batch)
# Send results back to individual users
for req, pred in zip(batch, predictions):
req.reply(pred)