Code Room
On-callHardoc-g463
Subject Schema migrationLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

An expand-contract migration on the `subscriptions` service is mid-flight. Step 1 (last week): add a nullable `billing_cycle_anchor` column, backfill it, and have the NEW code read it with a COALESCE fallback to the old `created_at`. Step 2 ran in today's 10:00 deploy: drop the COALESCE and read `billing_cycle_anchor` directly, assuming the backfill is complete. At 10:06, ~0.5% of invoice generations crash with `NoneType has no attribute 'day'`, all for accounts created in a narrow window. Dashboards: error rate stepped up exactly at 10:06; DB is healthy; the backfill job's dashboard shows 100% complete and green. The crashing accounts were all created BETWEEN the start of the backfill and the COALESCE-removal deploy. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.