On-callMediumoc-g571

Subject Feature pipeline store timeoutLevel Mid–Senior~30 minCommon in ML systems interviewsIndustries Technology

Question

Your ranking service reads ~40 features per request from an online feature store (Redis-backed). At 15:10 the service's error rate climbs to 3% with timeouts, and p99 latency on the feature-fetch step spikes. Dashboards: the feature store's Redis node shows high CPU and a jump in slow-log entries; the keyspace grew sharply after a backfill job started writing a new large feature ('user_event_history') as a big serialized blob per user; client-side connection-pool wait time is up. A backfill of that feature began at 15:00. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.