Already-warm replicas keep serving from memory when the registry dies — it's the new pods that can't find out which model to load, so you can't add capacity exactly when you need it.
A model registry stores model artifacts and metadata: which model_version is currently "production", where each artifact lives, and what stage it's in. Serving replicas consult the registry to learn which model to load — on startup, when autoscale spins up a new pod, and when someone promotes a new version.
The hazard is subtle. When the registry goes down, healthy already-warm replicas keep serving fine, because they cached the resolved model in memory. But any new or restarting replica can't resolve which model to load, so it can't become ready. During a traffic spike plus an autoscale event you can't add capacity, and a rolling deploy stalls or crash-loops. This is the classic cold dependency in the hot startup path.
The fix: cache the last-known-good model pointer locally (on disk or a sidecar), pin a fallback artifact, and make the registry a soft dependency on startup — resolve from cache when the registry is unreachable, and serve registry reads from a read replica or a CDN-cached manifest.
Resolve the production model from the registry when you can, but persist the last-known-good pointer locally so a fresh pod can still answer "which model?" when the registry is unreachable. The readiness probe must only pass once a model is actually loaded — never before.
LKG_PATH = "/var/lib/serving/last_known_good.json"
PINNED_FALLBACK = "s3://models/recommender/v7" # safe artifact, baked in
def resolve_production_model():
# 1. Try the registry first — it has the freshest truth.
try:
pointer = registry.get_production("recommender", timeout=0.5)
persist_lkg(pointer) # cache it for the next cold start
return pointer, "registry"
except (Timeout, Unavailable):
pass
# 2. Registry is unreachable. Fall back to the last-known-good pointer
# we persisted locally — the registry is a SOFT startup dependency.
if (lkg := read_lkg(LKG_PATH)):
return lkg, "local_cache"
# 3. Nothing cached (truly cold pod). Use the pinned fallback artifact.
return {"version": "v7", "uri": PINNED_FALLBACK}, "pinned"
def startup():
pointer, source = resolve_production_model()
model = load_artifact(pointer["uri"]) # pull weights, warm the model
app.state.model = model
log.info("resolved %s via %s", pointer["version"], source)
def readiness():
# Only report ready when a model is actually loaded in memory.
return "ok" if getattr(app.state, "model", None) else ("loading", 503)
| Strategy | Availability during outage | Staleness risk | Can scale during outage |
|---|---|---|---|
| Registry on every request | None — fails instantly | Always fresh | No |
| Cache on startup | Warm pods fine, new pods stuck | Fresh per cold start | No |
| Local last-known-good | New pods resolve from disk | As old as last good resolve | Yes |
| Pinned fallback artifact | Any pod can boot a model | Pinned version may lag prod | Yes |
The trade-off is freshness for survivability: a cached or pinned pointer might lag the true production version, but it lets a brand-new pod become ready without a live registry — which is exactly what a spike needs.
Traffic doubles after a feature launch. Three warm replicas are serving v7 happily from memory — but the registry is mid-outage. Autoscale fires and a fourth pod starts: its startup calls registry.get_production(), times out, and the readiness probe correctly returns 503, so the pod sits not ready and never takes traffic. The spike has no extra capacity, and a rolling deploy that's trying to replace pods begins to crash-loop. On-call contains: stop the rollout, freeze scale-down so warm pods aren't reaped, and confirm warm replicas are still green. Then the fix lands — the new pod reads the locally persisted last-known-good pointer (v7), loads the artifact, and finally reports ready. Capacity recovers without the registry, and once the registry is back the next resolve refreshes the cache. Root cause: startup hard-depended on a live registry call with no local cache.
The registry is down. Why do your existing replicas keep serving fine while a brand-new pod can't?
What makes the registry a soft dependency on the startup path?