Question
During a rolling deploy of the `payments` API (24 pods, ~10 min), reconciliation later flags a small cluster of DOUBLE charges, all timestamped within the rollout window. Context: v51 (new) changed the idempotency-key derivation to include a newly-added `attempt_id` field; v50 (old) derives the key the old way (without it). The idempotency store is shared. During the rollout, a client retry can hit a v50 pod the first time and a v51 pod on retry (or vice versa): the two versions compute DIFFERENT idempotency keys for the SAME logical payment, so the dedup check misses and the charge runs twice. Dashboards: no error spike, p99 normal, charge volume slightly elevated. Triage, explain why only the rollout window is affected, then mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.