Model serving registry outage

Already-warm replicas keep serving from memory when the registry dies — it's the new pods that can't find out which model to load, so you can't add capacity exactly when you need it.

The idea

A model registry stores model artifacts and metadata: which model_version is currently "production", where each artifact lives, and what stage it's in. Serving replicas consult the registry to learn which model to load — on startup, when autoscale spins up a new pod, and when someone promotes a new version.

The hazard is subtle. When the registry goes down, healthy already-warm replicas keep serving fine, because they cached the resolved model in memory. But any new or restarting replica can't resolve which model to load, so it can't become ready. During a traffic spike plus an autoscale event you can't add capacity, and a rolling deploy stalls or crash-loops. This is the classic cold dependency in the hot startup path.

The fix: cache the last-known-good model pointer locally (on disk or a sidecar), pin a fallback artifact, and make the registry a soft dependency on startup — resolve from cache when the registry is unreachable, and serve registry reads from a read replica or a CDN-cached manifest.

Press play to watch a registry outage hit a scale-up.

How it works

Resolve the production model from the registry when you can, but persist the last-known-good pointer locally so a fresh pod can still answer "which model?" when the registry is unreachable. The readiness probe must only pass once a model is actually loaded — never before.

LKG_PATH = "/var/lib/serving/last_known_good.json"
PINNED_FALLBACK = "s3://models/recommender/v7"  # safe artifact, baked in

def resolve_production_model():
    # 1. Try the registry first — it has the freshest truth.
    try:
        pointer = registry.get_production("recommender", timeout=0.5)
        persist_lkg(pointer)          # cache it for the next cold start
        return pointer, "registry"
    except (Timeout, Unavailable):
        pass

    # 2. Registry is unreachable. Fall back to the last-known-good pointer
    #    we persisted locally — the registry is a SOFT startup dependency.
    if (lkg := read_lkg(LKG_PATH)):
        return lkg, "local_cache"

    # 3. Nothing cached (truly cold pod). Use the pinned fallback artifact.
    return {"version": "v7", "uri": PINNED_FALLBACK}, "pinned"

def startup():
    pointer, source = resolve_production_model()
    model = load_artifact(pointer["uri"])   # pull weights, warm the model
    app.state.model = model
    log.info("resolved %s via %s", pointer["version"], source)

def readiness():
    # Only report ready when a model is actually loaded in memory.
    return "ok" if getattr(app.state, "model", None) else ("loading", 503)

Signals and trade-offs

Strategy	Availability during outage	Staleness risk	Can scale during outage
Registry on every request	None — fails instantly	Always fresh	No
Cache on startup	Warm pods fine, new pods stuck	Fresh per cold start	No
Local last-known-good	New pods resolve from disk	As old as last good resolve	Yes
Pinned fallback artifact	Any pod can boot a model	Pinned version may lag prod	Yes

The trade-off is freshness for survivability: a cached or pinned pointer might lag the true production version, but it lets a brand-new pod become ready without a live registry — which is exactly what a spike needs.

Watch out for

Registry on the hot startup path with no cache. If a fresh pod must reach the registry to learn which model to load, the registry's uptime caps your ability to ever add a replica. Persist a last-known-good pointer locally so startup can proceed offline.
Autoscale plus outage equals no new capacity. The moment you most need to scale — a traffic spike — is when new pods are calling the registry. If they can't resolve, the spike has no relief and latency climbs across the warm pods.
A readiness probe that passes before the model loads. If readiness returns ok while the pod is still resolving, the gateway sends real traffic to a pod with no model. Gate readiness on an actually-loaded model, not on "process started."
No last-known-good persisted. Caching the resolved pointer only in memory dies with the pod. Write it to disk or a sidecar so the next cold start can read it.
Promotions also need the registry. A "promote new version" event reads and writes the registry, so promotions and rollouts stall during the outage. Freeze rollouts and autoscaling-down until the registry recovers.

Worked example

Traffic doubles after a feature launch. Three warm replicas are serving v7 happily from memory — but the registry is mid-outage. Autoscale fires and a fourth pod starts: its startup calls registry.get_production(), times out, and the readiness probe correctly returns 503, so the pod sits not ready and never takes traffic. The spike has no extra capacity, and a rolling deploy that's trying to replace pods begins to crash-loop. On-call contains: stop the rollout, freeze scale-down so warm pods aren't reaped, and confirm warm replicas are still green. Then the fix lands — the new pod reads the locally persisted last-known-good pointer (v7), loads the artifact, and finally reports ready. Capacity recovers without the registry, and once the registry is back the next resolve refreshes the cache. Root cause: startup hard-depended on a live registry call with no local cache.

Check yourself

The registry is down. Why do your existing replicas keep serving fine while a brand-new pod can't?

What makes the registry a soft dependency on the startup path?