Question
A scheduling service deploys a refactor of its time handling on a Friday; no issues all weekend. Monday at 09:00 local, support is flooded: recurring meetings and reminders are firing one hour off for a large subset of users — but only users in regions that just had a daylight-saving transition the prior weekend; users in non-DST regions are fine. Dashboards: no error-rate change, no latency change; a business metric for 'reminder accuracy' (derived from user-reported mismatches) is the only red signal. The refactor switched from storing event times with timezone-aware instants to storing a naive local datetime plus a fixed UTC offset captured at creation time. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.