Question
A global config change (raising a per-request memory budget for an image-transform edge worker) is rolled out region by region via a config-management pipeline. Three days in, it's applied to 14 of 19 regions. A customer reports that the same image URL returns a correctly-resized image from some POPs and a 500 from others, seemingly at random by geography. Dashboards: error rate is flat globally (the 500s are <0.05%), but per-region error rate shows a clean split — the 5 not-yet-updated regions have a small steady 500 rate on large images; the 14 updated regions are clean. The config pipeline dashboard shows the rollout as 'in progress, no failures.' Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.