On-callHardoc-g475

Subject Deploy incidentsLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A change ships in two separate artifacts: a code deploy (new `tax` service that reads a new `rate_table_v2` config) and a config deploy (publishing `rate_table_v2`). They're deployed by different pipelines and the team assumed config would land first, but today the CODE pipeline finished at 09:58 and the CONFIG pipeline at 10:07. Between 09:58 and 10:07, the new code looked for `rate_table_v2`, didn't find it, and fell back to an empty table — so for 9 minutes, ~30% of checkouts computed $0 tax (no error, 200s). Dashboards: error rate flat, latency flat; a `tax_collected` business metric dipped sharply from 09:58 to 10:07 then recovered. Triage, explain the window, then prevent it.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.