Code Room
On-callMedium
Question
You run the search service (Python/FastAPI). At 16:40 the 5xx rate on GET /search climbs from 0.1% to 7% and holds steady — not growing, just elevated. Latency is normal. Your dashboard shows the errors are coming from exactly one of your three deployment regions, and within that region the error logs are all `KeyError: 'ranking_score'`. A new model artifact was pushed to the feature store at 16:35 and the ranking service in that region picked it up; the other two regions are on a slightly older artifact pull schedule. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.