Code Room
On-callHardoc-g275
Subject Canary failureLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A canary at 5% runs for 90 minutes and is healthy on every metric. It auto-promotes. Within 10 minutes of 100%, a payment-related bug surfaces: international checkouts in three currencies fail with a rounding/precision error, ~0.4% of total orders. Recent context: the canary's traffic split is done at the load balancer by source-IP hashing, and the canary pods happened to sit behind a region where almost all traffic is domestic single-currency. The new code changed money handling from integer minor-units to a floating-point intermediate. Dashboards: overall error rate barely moved (0.4%); a currency-segmented order-success panel shows a sharp cliff in JPY, KWD, and BHD at the promote time. How do you triage and mitigate, and what's wrong with the canary?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.