Code Room
On-callMedium
Question
Your personalization service fetches precomputed user + item embeddings from an embedding store (cache in front of a database) before scoring. At 18:00 p99 latency on the embedding-fetch step jumps from 15ms to 600ms and the backing DB shows a read-QPS spike. Dashboards: the embedding cache hit-rate fell from 98% to 60% at 18:00; a job that nightly regenerates and bulk-reuploads all item embeddings ran early today at 18:00 and rewrote every item key (changing values), and the cache evicted/invalidated en masse. No app deploy. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.