When the model your service calls goes dark, you don't go down with it — you degrade to something simpler that still answers.
A serving endpoint rarely works alone. To answer a request it might call a heavy primary model, a feature store, or an embedding service. Any of those can time out, throttle, or crash — and if your endpoint just waits, every request piles up behind the broken dependency.
A dependency fallback is a pre-planned, cheaper answer you return when the primary path fails: a smaller model, a cached prediction, or a safe default. The goal is graceful degradation — slightly worse answers stay far better than no answers, and the failure stays contained instead of cascading.
Wrap the primary call in a short timeout and a circuit breaker. On failure, fall back through a ranked chain of cheaper options. The breaker trips after repeated failures so you stop hammering a dead dependency, then probes occasionally to recover.
def predict(request):
# 1. Circuit breaker: if primary is already known-bad, skip straight to fallback.
if breaker.is_open("primary"):
return fallback_chain(request, reason="breaker_open")
try:
# 2. Bound the wait — a hung dependency must not hang us.
result = primary_model.infer(request, timeout=0.15)
breaker.record_success("primary")
return result
except (Timeout, Unavailable, RateLimited) as e:
# 3. Record the failure so the breaker can trip after a threshold.
breaker.record_failure("primary")
return fallback_chain(request, reason=str(e))
def fallback_chain(request, reason):
# Try cheapest-good first, then progressively safer defaults.
if (cached := prediction_cache.get(request.key)):
return degraded(cached, source="cache", reason=reason)
if small_model.healthy():
return degraded(small_model.infer(request), source="small_model", reason=reason)
return degraded(safe_default(request), source="default", reason=reason)
| Layer | Answer quality | When it serves |
|---|---|---|
| Primary model | Best | Healthy and within timeout |
| Cached prediction | Good, possibly stale | Primary failed, key seen before |
| Small model | Acceptable | No cache hit, small model healthy |
| Safe default | Minimal | Everything else is down |
The trade-off is correctness for availability: each step down the chain serves a slightly worse prediction in exchange for still serving at all.
served_by="fallback", an outage can run for hours looking healthy while quality quietly tanks.A ranking service calls a primary recommender with a 150 ms timeout. During a deploy the primary starts returning 503s. The first few requests time out and fall back to a cached top-list; after five failures in a row the breaker opens, so the next requests skip the primary entirely and answer from the small popularity model in under 10 ms. A dashboard shows served_by="fallback" climbing to 100%, paging the on-call. Two minutes later the deploy is rolled back, the breaker's probe succeeds, and traffic flows back to the primary. Users saw slightly less personalized results for two minutes instead of a blank page.
Your primary model starts timing out. What should the very first protective layer be?
Why must the fallback path avoid the primary's dependencies?