Question
A canary at 5% runs for 90 minutes and is healthy on every metric. It auto-promotes. Within 10 minutes of 100%, a payment-related bug surfaces: international checkouts in three currencies fail with a rounding/precision error, ~0.4% of total orders. Recent context: the canary's traffic split is done at the load balancer by source-IP hashing, and the canary pods happened to sit behind a region where almost all traffic is domestic single-currency. The new code changed money handling from integer minor-units to a floating-point intermediate. Dashboards: overall error rate barely moved (0.4%); a currency-segmented order-success panel shows a sharp cliff in JPY, KWD, and BHD at the promote time. How do you triage and mitigate, and what's wrong with the canary?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.