Question
Every deploy runs migrations as a gated pre-step. Today's 15:00 deploy hangs in the migration step: the migration is a quick `ADD COLUMN ... DEFAULT now()` that should take a second, but the deploy pipeline is stuck and times out after 10 minutes, blocking the release of an urgent hotfix. Dashboards: the app is otherwise healthy (the migration's ALTER hasn't acquired its lock so it isn't blocking live queries yet — it's waiting), `pg_stat_activity` shows the ALTER in `lock not available`/waiting, and the migration tool is configured with a `lock_timeout` of 0 (infinite) so it just waits. Ahead of it in the lock queue: a long-running analytics transaction that's been open for 40 minutes. New queries are starting to queue behind the ALTER. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.