On-callHardoc-g465

Subject Config changeLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A global config change (raising a per-request memory budget for an image-transform edge worker) is rolled out region by region via a config-management pipeline. Three days in, it's applied to 14 of 19 regions. A customer reports that the same image URL returns a correctly-resized image from some POPs and a 500 from others, seemingly at random by geography. Dashboards: error rate is flat globally (the 500s are <0.05%), but per-region error rate shows a clean split — the 5 not-yet-updated regions have a small steady 500 rate on large images; the 14 updated regions are clean. The config pipeline dashboard shows the rollout as 'in progress, no failures.' Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.