The first request to a fresh replica pays for everything that hasn’t loaded yet.
When a new model replica spins up — a deploy, an autoscale event, or a scale-from-zero wake — it is not ready to answer quickly. The container has to start, the model weights have to load from storage into memory and onto the GPU, the inference graph has to compile or warm up, and every cache is empty.
Until all of that finishes, the replica is cold. Its first requests are slow, and if traffic arrives before it is warm, requests queue, latency spikes, and you breach your p95 SLO. The fix is to load and warm up before serving any real traffic.
The lifecycle runs cold → loading → warming → warm. The trick is to make the load balancer wait for the last state. A readiness probe reports the replica as ready only after weights are loaded and a warmup forward pass on a dummy batch has run, so the inference graph is already compiled. Traffic is routed only once /ready returns 200.
weights_loaded = False
warmed_up = False
def start():
global weights_loaded, warmed_up
load_weights() # storage -> CPU/GPU memory
weights_loaded = True
# warm up: compile graph & fill caches before real traffic
dummy = make_dummy_batch()
for _ in range(3):
model.forward(dummy) # triggers CUDA/JIT compile
warmed_up = True
def ready(): # GET /ready
if weights_loaded and warmed_up:
return 200 # router may now send traffic
return 503 # keep me out of rotation
# keep-warm: a periodic synthetic request stops the replica
# (and any scaled-down GPU) from going cold during idle gaps
def keep_warm_loop():
while True:
if idle_for() > 60:
model.forward(make_dummy_batch())
sleep(30)
Scale-to-zero saves money but means every idle period ends in another cold start. A small keep-warm pool — one or two spare warm replicas — absorbs the first burst while the autoscaler brings up more.
| Stage | Typical time | What it costs |
|---|---|---|
| Container start | 10–60s | Image pull from registry, runtime + CUDA init |
| Load weights | 15–90s | Read GBs from storage into host then GPU memory |
| Warm up | 2–10s | CUDA/graph compile, kernel autotune on first pass |
| First real request | +1–6s | JIT compile if no warmup, plus cold caches |
| Steady state | ~80ms | Caches filled, graph cached, GPU resident |
A 7B-parameter replica takes about 40s to pull the image, 25s to load weights onto the GPU, and its first inference triggers a 6s CUDA graph compile. A traffic spike at 9am scales the deployment and routes to a brand-new replica.
Without protection, its first ~20 requests see 1.2s+ latency and breach the 300ms SLO until the graph is cached and the GPU caches fill — exactly the curve above. The fix is a readiness probe (no traffic until weights load and a dummy batch has run) plus a warm pool of 2 spare replicas that absorbs the spike while the new replica finishes warming. By the time it joins rotation, it is already under the line.
Your replicas scale to zero between bursts of traffic. Users complain that the first request after a quiet spell is slow, every time. What is the most direct fix?
Your readiness probe returns 200 as soon as the weights finish loading. The first real request is still slow. Why?