Model serving recommendation stale cache

A cache makes recommendations cheap and fast — until the model moves on and the cache keeps handing out yesterday's answer.

The idea

Recommendation lists are expensive to compute: a model scores thousands of candidate items per user, then ranks them. Doing that on every page load would be slow and costly, so the service caches each user's list behind a time-to-live (TTL). A cache hit returns the list instantly; a miss recomputes it and stores the result for the next reader.

The hazard is staleness. When the model is retrained to a new version, or a user's behaviour changes — they just bought the very thing you keep recommending — the cache happily serves the old list until the TTL expires or something invalidates it. Latency looks great the whole time, so a quietly degrading experience can hide behind a healthy dashboard.

Press play to send a user request through the recommendation cache.

How it works

Build the cache key from the user and the model version, so a new model version is automatically a fresh namespace — no old entry can answer a new model's request. On a hit you return instantly, accepting staleness bounded by the TTL. On a miss you recompute and store. And when a user takes an action that obviously changes their needs, invalidate the key right away instead of waiting out the TTL.

MODEL_VERSION = "v2"          # bumped on every retrain / redeploy

def get_recommendations(user):
    # Model version is part of the key: v2 can never read v1's cached list.
    key = f"recs:{user.id}:{MODEL_VERSION}"

    if (hit := cache.get(key)) is not None:
        # Fast path. Staleness is bounded by the TTL we set on write.
        return hit                       # served from cache

    # Miss: pay the expensive model once, then cache for the next reader.
    recs = model.score_and_rank(user)    # the costly pipeline
    cache.set(key, recs, ttl=600)        # 10 min for an average user
    return recs

def on_purchase(user, item):
    # The user just bought something — their list is now obviously stale.
    # Invalidate immediately instead of waiting for the TTL to lapse.
    key = f"recs:{user.id}:{MODEL_VERSION}"
    cache.delete(key)                    # next read recomputes fresh

Signals and trade-offs

StrategyFreshnessLatency & cost
Long TTLWeak — stale for agesLowest latency, lowest cost
Short TTLBetter, still windowedMore recomputes, higher cost
Model version in keyNew model is instantly freshOne cold recompute per version
Event invalidationStrong on the events you wireCheap, but only as good as your events
Stale-while-revalidateGood — fresh within one cycleFast reads, recompute off the path

The trade-off is freshness for latency and cost: every step toward fresher answers spends more recomputes, so you tune the TTL to how fast each user's needs actually move.

Watch out for

Worked example

A recommendation service caches each user's list with a 10 min TTL under the key recs:{user.id}. The team retrains and ships v2. Latency stays flat and the on-call sees nothing — but CTR drops 6% overnight. Triage shows reads are still served from cache; the cached lists were all computed by v1, because the key never included the model version, so v2 kept reading v1's entries. To contain it they drop the TTL and purge the namespace, forcing fresh recomputes. The root-cause fix changes the key to recs:{user.id}:{MODEL_VERSION} so the next redeploy lands in a clean namespace and a stale model can never answer a new model's request. They also add on_purchase invalidation, so a user who buys an item stops seeing it recommended within seconds instead of ten minutes.

Check yourself

After a redeploy to v2, users still see v1's recommendations even though latency looks perfect. What's the most likely cause?

A user just bought the item you keep recommending. What's the cleanest way to stop showing it without waiting out the TTL?