Model serving recommendation stale cache

A cache makes recommendations cheap and fast — until the model moves on and the cache keeps handing out yesterday's answer.

The idea

Recommendation lists are expensive to compute: a model scores thousands of candidate items per user, then ranks them. Doing that on every page load would be slow and costly, so the service caches each user's list behind a time-to-live (TTL). A cache hit returns the list instantly; a miss recomputes it and stores the result for the next reader.

The hazard is staleness. When the model is retrained to a new version, or a user's behaviour changes — they just bought the very thing you keep recommending — the cache happily serves the old list until the TTL expires or something invalidates it. Latency looks great the whole time, so a quietly degrading experience can hide behind a healthy dashboard.

Press play to send a user request through the recommendation cache.

How it works

Build the cache key from the user and the model version, so a new model version is automatically a fresh namespace — no old entry can answer a new model's request. On a hit you return instantly, accepting staleness bounded by the TTL. On a miss you recompute and store. And when a user takes an action that obviously changes their needs, invalidate the key right away instead of waiting out the TTL.

MODEL_VERSION = "v2"          # bumped on every retrain / redeploy

def get_recommendations(user):
    # Model version is part of the key: v2 can never read v1's cached list.
    key = f"recs:{user.id}:{MODEL_VERSION}"

    if (hit := cache.get(key)) is not None:
        # Fast path. Staleness is bounded by the TTL we set on write.
        return hit                       # served from cache

    # Miss: pay the expensive model once, then cache for the next reader.
    recs = model.score_and_rank(user)    # the costly pipeline
    cache.set(key, recs, ttl=600)        # 10 min for an average user
    return recs

def on_purchase(user, item):
    # The user just bought something — their list is now obviously stale.
    # Invalidate immediately instead of waiting for the TTL to lapse.
    key = f"recs:{user.id}:{MODEL_VERSION}"
    cache.delete(key)                    # next read recomputes fresh

Signals and trade-offs

Strategy	Freshness	Latency & cost
Long TTL	Weak — stale for ages	Lowest latency, lowest cost
Short TTL	Better, still windowed	More recomputes, higher cost
Model version in key	New model is instantly fresh	One cold recompute per version
Event invalidation	Strong on the events you wire	Cheap, but only as good as your events
Stale-while-revalidate	Good — fresh within one cycle	Fast reads, recompute off the path

The trade-off is freshness for latency and cost: every step toward fresher answers spends more recomputes, so you tune the TTL to how fast each user's needs actually move.

Watch out for

Model version missing from the key. If the key is just recs:{user.id}, a redeploy to v2 keeps reading v1's cached lists until every TTL lapses — a silent, hours-long stale window across all users.
No invalidation on user actions. A purchase, unfollow, or "not interested" obviously changes a list, but if nothing deletes the key the cache keeps recommending the item they just bought.
Unbounded TTL. A cache entry with no expiry is fine — until it quietly becomes the oldest data in your system. Always cap how stale you'll serve.
Thundering herd on a hot key. When a popular user's entry expires, every concurrent request misses at once and stampedes the model. Use a lock or stale-while-revalidate so only one recompute runs.
Reading low latency as healthy. Hit rate and p99 can look perfect while click-through quietly drops. Watch a quality metric — CTR or freshness age — alongside latency.

Worked example

A recommendation service caches each user's list with a 10 min TTL under the key recs:{user.id}. The team retrains and ships v2. Latency stays flat and the on-call sees nothing — but CTR drops 6% overnight. Triage shows reads are still served from cache; the cached lists were all computed by v1, because the key never included the model version, so v2 kept reading v1's entries. To contain it they drop the TTL and purge the namespace, forcing fresh recomputes. The root-cause fix changes the key to recs:{user.id}:{MODEL_VERSION} so the next redeploy lands in a clean namespace and a stale model can never answer a new model's request. They also add on_purchase invalidation, so a user who buys an item stops seeing it recommended within seconds instead of ten minutes.

Check yourself

After a redeploy to v2, users still see v1's recommendations even though latency looks perfect. What's the most likely cause?

A user just bought the item you keep recommending. What's the cleanest way to stop showing it without waiting out the TTL?