On-callMediumoc-g483

Subject Config changeLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries IT services, Technology

Question

In an enterprise ITSM platform, an admin updates a config that sets the default business-hours calendar used for SLA clocks (e.g., changing the org default from 'UTC 9-5' to 'follow-the-sun' regional). The change is applied at 16:00 and propagates to the SLA engine. Within an hour, the support team reports that a batch of in-flight tickets suddenly flipped from 'on track' to 'breached,' and some auto-escalations fired and paged on-call managers needlessly. Dashboards: no service errors; the SLA-engine recompute job ran at config-apply time and re-derived remaining-time for OPEN tickets against the new calendar; a 'breach count' metric jumped at 16:00 then partially settled. Triage, explain the mechanism, then mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.