A cache makes recommendations cheap and fast — until the model moves on and the cache keeps handing out yesterday's answer.
Recommendation lists are expensive to compute: a model scores thousands of candidate items per user, then ranks them. Doing that on every page load would be slow and costly, so the service caches each user's list behind a time-to-live (TTL). A cache hit returns the list instantly; a miss recomputes it and stores the result for the next reader.
The hazard is staleness. When the model is retrained to a new version, or a user's behaviour changes — they just bought the very thing you keep recommending — the cache happily serves the old list until the TTL expires or something invalidates it. Latency looks great the whole time, so a quietly degrading experience can hide behind a healthy dashboard.
Build the cache key from the user and the model version, so a new model version is automatically a fresh namespace — no old entry can answer a new model's request. On a hit you return instantly, accepting staleness bounded by the TTL. On a miss you recompute and store. And when a user takes an action that obviously changes their needs, invalidate the key right away instead of waiting out the TTL.
MODEL_VERSION = "v2" # bumped on every retrain / redeploy
def get_recommendations(user):
# Model version is part of the key: v2 can never read v1's cached list.
key = f"recs:{user.id}:{MODEL_VERSION}"
if (hit := cache.get(key)) is not None:
# Fast path. Staleness is bounded by the TTL we set on write.
return hit # served from cache
# Miss: pay the expensive model once, then cache for the next reader.
recs = model.score_and_rank(user) # the costly pipeline
cache.set(key, recs, ttl=600) # 10 min for an average user
return recs
def on_purchase(user, item):
# The user just bought something — their list is now obviously stale.
# Invalidate immediately instead of waiting for the TTL to lapse.
key = f"recs:{user.id}:{MODEL_VERSION}"
cache.delete(key) # next read recomputes fresh
| Strategy | Freshness | Latency & cost |
|---|---|---|
| Long TTL | Weak — stale for ages | Lowest latency, lowest cost |
| Short TTL | Better, still windowed | More recomputes, higher cost |
| Model version in key | New model is instantly fresh | One cold recompute per version |
| Event invalidation | Strong on the events you wire | Cheap, but only as good as your events |
| Stale-while-revalidate | Good — fresh within one cycle | Fast reads, recompute off the path |
The trade-off is freshness for latency and cost: every step toward fresher answers spends more recomputes, so you tune the TTL to how fast each user's needs actually move.
recs:{user.id}, a redeploy to v2 keeps reading v1's cached lists until every TTL lapses — a silent, hours-long stale window across all users.A recommendation service caches each user's list with a 10 min TTL under the key recs:{user.id}. The team retrains and ships v2. Latency stays flat and the on-call sees nothing — but CTR drops 6% overnight. Triage shows reads are still served from cache; the cached lists were all computed by v1, because the key never included the model version, so v2 kept reading v1's entries. To contain it they drop the TTL and purge the namespace, forcing fresh recomputes. The root-cause fix changes the key to recs:{user.id}:{MODEL_VERSION} so the next redeploy lands in a clean namespace and a stale model can never answer a new model's request. They also add on_purchase invalidation, so a user who buys an item stops seeing it recommended within seconds instead of ten minutes.
After a redeploy to v2, users still see v1's recommendations even though latency looks perfect. What's the most likely cause?
A user just bought the item you keep recommending. What's the cleanest way to stop showing it without waiting out the TTL?