Model serving dependency fallback

When the model your service calls goes dark, you don't go down with it — you degrade to something simpler that still answers.

The idea

A serving endpoint rarely works alone. To answer a request it might call a heavy primary model, a feature store, or an embedding service. Any of those can time out, throttle, or crash — and if your endpoint just waits, every request piles up behind the broken dependency.

A dependency fallback is a pre-planned, cheaper answer you return when the primary path fails: a smaller model, a cached prediction, or a safe default. The goal is graceful degradation — slightly worse answers stay far better than no answers, and the failure stays contained instead of cascading.

Press play to send live traffic through the primary model.

How it works

Wrap the primary call in a short timeout and a circuit breaker. On failure, fall back through a ranked chain of cheaper options. The breaker trips after repeated failures so you stop hammering a dead dependency, then probes occasionally to recover.

def predict(request):
    # 1. Circuit breaker: if primary is already known-bad, skip straight to fallback.
    if breaker.is_open("primary"):
        return fallback_chain(request, reason="breaker_open")

    try:
        # 2. Bound the wait — a hung dependency must not hang us.
        result = primary_model.infer(request, timeout=0.15)
        breaker.record_success("primary")
        return result

    except (Timeout, Unavailable, RateLimited) as e:
        # 3. Record the failure so the breaker can trip after a threshold.
        breaker.record_failure("primary")
        return fallback_chain(request, reason=str(e))

def fallback_chain(request, reason):
    # Try cheapest-good first, then progressively safer defaults.
    if (cached := prediction_cache.get(request.key)):
        return degraded(cached, source="cache", reason=reason)
    if small_model.healthy():
        return degraded(small_model.infer(request), source="small_model", reason=reason)
    return degraded(safe_default(request), source="default", reason=reason)

Signals and trade-offs

Layer	Answer quality	When it serves
Primary model	Best	Healthy and within timeout
Cached prediction	Good, possibly stale	Primary failed, key seen before
Small model	Acceptable	No cache hit, small model healthy
Safe default	Minimal	Everything else is down

The trade-off is correctness for availability: each step down the chain serves a slightly worse prediction in exchange for still serving at all.

Watch out for

No timeout on the primary. Without a bound, one slow dependency exhausts your thread or connection pool and the whole endpoint stalls — the fallback never even runs.
A fallback that depends on the same thing. If your "small model" reads the same feature store that just died, your fallback dies with it. Fallbacks must fail independently.
Silent degradation. If you don't emit a metric like served_by="fallback", an outage can run for hours looking healthy while quality quietly tanks.
No circuit breaker. Retrying a dead dependency on every request adds load to something already failing and slows your own recovery. Trip the breaker, then probe.
Stale cache with no bound. Serving a cached prediction is fine; serving a week-old one as if it were fresh is not. Cap the age you'll accept.

Worked example

A ranking service calls a primary recommender with a 150 ms timeout. During a deploy the primary starts returning 503s. The first few requests time out and fall back to a cached top-list; after five failures in a row the breaker opens, so the next requests skip the primary entirely and answer from the small popularity model in under 10 ms. A dashboard shows served_by="fallback" climbing to 100%, paging the on-call. Two minutes later the deploy is rolled back, the breaker's probe succeeds, and traffic flows back to the primary. Users saw slightly less personalized results for two minutes instead of a blank page.

Check yourself

Your primary model starts timing out. What should the very first protective layer be?

Why must the fallback path avoid the primary's dependencies?