Code Room
On-callMediumoc-g674
Subject Inference output contract breakLevel Mid–Senior~30 minCommon in ML systems · Reliability & on-call · Distributed systems interviewsIndustries Technology

Question

Your inference service scores transactions and returns a JSON response that a downstream payments service consumes synchronously to decide approve/decline. At 16:20 the payments team pages you: their decline rate jumped and they're seeing parse errors from your service, though your own dashboards look green — you return 200s at normal latency with no errors logged. Investigation: a deploy 20 minutes ago changed your response 'score' field from a float (0.0–1.0) to a nested object {value, version} to add model-version metadata; the downstream parser expects a top-level float, so it fails to read the score and falls back to a conservative 'decline'. Your contract/schema with them wasn't versioned and they weren't notified. How do you triage and respond?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.