Model serving dependency fallback

When the model your service calls goes dark, you don't go down with it — you degrade to something simpler that still answers.

The idea

A serving endpoint rarely works alone. To answer a request it might call a heavy primary model, a feature store, or an embedding service. Any of those can time out, throttle, or crash — and if your endpoint just waits, every request piles up behind the broken dependency.

A dependency fallback is a pre-planned, cheaper answer you return when the primary path fails: a smaller model, a cached prediction, or a safe default. The goal is graceful degradation — slightly worse answers stay far better than no answers, and the failure stays contained instead of cascading.

Press play to send live traffic through the primary model.

How it works

Wrap the primary call in a short timeout and a circuit breaker. On failure, fall back through a ranked chain of cheaper options. The breaker trips after repeated failures so you stop hammering a dead dependency, then probes occasionally to recover.

def predict(request):
    # 1. Circuit breaker: if primary is already known-bad, skip straight to fallback.
    if breaker.is_open("primary"):
        return fallback_chain(request, reason="breaker_open")

    try:
        # 2. Bound the wait — a hung dependency must not hang us.
        result = primary_model.infer(request, timeout=0.15)
        breaker.record_success("primary")
        return result

    except (Timeout, Unavailable, RateLimited) as e:
        # 3. Record the failure so the breaker can trip after a threshold.
        breaker.record_failure("primary")
        return fallback_chain(request, reason=str(e))

def fallback_chain(request, reason):
    # Try cheapest-good first, then progressively safer defaults.
    if (cached := prediction_cache.get(request.key)):
        return degraded(cached, source="cache", reason=reason)
    if small_model.healthy():
        return degraded(small_model.infer(request), source="small_model", reason=reason)
    return degraded(safe_default(request), source="default", reason=reason)

Signals and trade-offs

LayerAnswer qualityWhen it serves
Primary modelBestHealthy and within timeout
Cached predictionGood, possibly stalePrimary failed, key seen before
Small modelAcceptableNo cache hit, small model healthy
Safe defaultMinimalEverything else is down

The trade-off is correctness for availability: each step down the chain serves a slightly worse prediction in exchange for still serving at all.

Watch out for

Worked example

A ranking service calls a primary recommender with a 150 ms timeout. During a deploy the primary starts returning 503s. The first few requests time out and fall back to a cached top-list; after five failures in a row the breaker opens, so the next requests skip the primary entirely and answer from the small popularity model in under 10 ms. A dashboard shows served_by="fallback" climbing to 100%, paging the on-call. Two minutes later the deploy is rolled back, the breaker's probe succeeds, and traffic flows back to the primary. Users saw slightly less personalized results for two minutes instead of a blank page.

Check yourself

Your primary model starts timing out. What should the very first protective layer be?

Why must the fallback path avoid the primary's dependencies?