On-callHardoc-g132

Subject Data incidentsLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 00:00 UTC on the night clocks 'fell back' (DST ended in a region), a session-and-rate-limit store starts behaving wrongly: some users are logged out an hour early, some abusers' rate limits never reset, and a TTL-based dedup window briefly lets duplicate events through. Dashboards: one app fleet's hosts log timestamps an hour off from the others; the store keys are computed from a LOCAL-time wall clock rather than a monotonic/UTC source; the dedup window keys events by a truncated local-time bucket; a config sets `TZ` per-host and three hosts were rebuilt last week WITHOUT the timezone package, so they fall back to UTC while the rest use local time. How do you triage this time-related incident, stop the wrong expiries and duplicate leakage, and reconcile affected state?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.