Model serving cold start

The first request to a fresh replica pays for everything that hasn’t loaded yet.

The idea

When a new model replica spins up — a deploy, an autoscale event, or a scale-from-zero wake — it is not ready to answer quickly. The container has to start, the model weights have to load from storage into memory and onto the GPU, the inference graph has to compile or warm up, and every cache is empty.

Until all of that finishes, the replica is cold. Its first requests are slow, and if traffic arrives before it is warm, requests queue, latency spikes, and you breach your p95 SLO. The fix is to load and warm up before serving any real traffic.

Watch a replica warm up

step 0/0 stage cold latency SLO 300ms served 0 queued 0

How it works

The lifecycle runs cold → loading → warming → warm. The trick is to make the load balancer wait for the last state. A readiness probe reports the replica as ready only after weights are loaded and a warmup forward pass on a dummy batch has run, so the inference graph is already compiled. Traffic is routed only once /ready returns 200.

weights_loaded = False
warmed_up = False

def start():
    global weights_loaded, warmed_up
    load_weights()                 # storage -> CPU/GPU memory
    weights_loaded = True
    # warm up: compile graph & fill caches before real traffic
    dummy = make_dummy_batch()
    for _ in range(3):
        model.forward(dummy)       # triggers CUDA/JIT compile
    warmed_up = True

def ready():                       # GET /ready
    if weights_loaded and warmed_up:
        return 200                 # router may now send traffic
    return 503                     # keep me out of rotation

# keep-warm: a periodic synthetic request stops the replica
# (and any scaled-down GPU) from going cold during idle gaps
def keep_warm_loop():
    while True:
        if idle_for() > 60:
            model.forward(make_dummy_batch())
        sleep(30)

Scale-to-zero saves money but means every idle period ends in another cold start. A small keep-warm pool — one or two spare warm replicas — absorbs the first burst while the autoscaler brings up more.

Cost

StageTypical timeWhat it costs
Container start10–60sImage pull from registry, runtime + CUDA init
Load weights15–90sRead GBs from storage into host then GPU memory
Warm up2–10sCUDA/graph compile, kernel autotune on first pass
First real request+1–6sJIT compile if no warmup, plus cold caches
Steady state~80msCaches filled, graph cached, GPU resident

Watch out for

Worked example

A 7B-parameter replica takes about 40s to pull the image, 25s to load weights onto the GPU, and its first inference triggers a 6s CUDA graph compile. A traffic spike at 9am scales the deployment and routes to a brand-new replica.

Without protection, its first ~20 requests see 1.2s+ latency and breach the 300ms SLO until the graph is cached and the GPU caches fill — exactly the curve above. The fix is a readiness probe (no traffic until weights load and a dummy batch has run) plus a warm pool of 2 spare replicas that absorbs the spike while the new replica finishes warming. By the time it joins rotation, it is already under the line.

Check yourself

Your replicas scale to zero between bursts of traffic. Users complain that the first request after a quiet spell is slow, every time. What is the most direct fix?

Your readiness probe returns 200 as soon as the weights finish loading. The first real request is still slow. Why?