Code Room
On-callMedium
Question
Your automated canary for v88 of the payments service runs at 10:00: 1 canary pod + 1 baseline pod, 5% traffic each. The canary analysis goes green and auto-promotes to 100% at 10:20. By 10:35 you're paged: 2% of payment captures are silently failing with a downstream gateway returning HTTP 402, but only for one currency (JPY). The canary dashboards during the 20-minute window showed identical error rates between canary and baseline. The diff in v88 changed how amounts are formatted before sending to the gateway. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.