Code Room
On-callMedium
Question
Your ranking service reads ~40 features per request from an online feature store (Redis-backed). At 15:10 the service's error rate climbs to 3% with timeouts, and p99 latency on the feature-fetch step spikes. Dashboards: the feature store's Redis node shows high CPU and a jump in slow-log entries; the keyspace grew sharply after a backfill job started writing a new large feature ('user_event_history') as a big serialized blob per user; client-side connection-pool wait time is up. A backfill of that feature began at 15:00. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.