Code Room
On-callHard
Question
A migration job kicks off at 02:30 to add a NOT NULL column with a default to the 'orders' table (Postgres 11, ~400M rows). At 02:31 writes to orders start timing out, the API returns 503s for anything touching orders, and pg_stat_activity shows a long-running ALTER TABLE holding an AccessExclusiveLock with a growing queue of blocked queries behind it. Replica lag begins climbing. The migration was tested on a staging copy with 1M rows and ran in 2 seconds. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.