On-callMediumoc-g281

Subject Bad rolloutLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Software development

Question

A scheduling service deploys a refactor of its time handling on a Friday; no issues all weekend. Monday at 09:00 local, support is flooded: recurring meetings and reminders are firing one hour off for a large subset of users — but only users in regions that just had a daylight-saving transition the prior weekend; users in non-DST regions are fine. Dashboards: no error-rate change, no latency change; a business metric for 'reminder accuracy' (derived from user-reported mismatches) is the only red signal. The refactor switched from storing event times with timezone-aware instants to storing a naive local datetime plus a fixed UTC offset captured at creation time. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.