Question
Your inference service scores transactions and returns a JSON response that a downstream payments service consumes synchronously to decide approve/decline. At 16:20 the payments team pages you: their decline rate jumped and they're seeing parse errors from your service, though your own dashboards look green — you return 200s at normal latency with no errors logged. Investigation: a deploy 20 minutes ago changed your response 'score' field from a float (0.0–1.0) to a nested object {value, version} to add model-version metadata; the downstream parser expects a top-level float, so it fails to read the score and falls back to a conservative 'decline'. Your contract/schema with them wasn't versioned and they weren't notified. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.