Question
A deploy's migration step runs `ALTER TABLE events ADD COLUMN processed_at TIMESTAMPTZ` (Postgres). The ALTER hangs. Two minutes later, the whole `events` table is effectively frozen: every read and write queues, p99 goes to seconds, the connection pool fills, and the app starts shedding requests. Dashboards: `pg_stat_activity` shows the ALTER in `lock not available` / waiting state, behind it a long pile-up of queries all `waiting`, and at the head of the lock queue a single old query — a reporting `SELECT ... FROM events` that's been running for 18 minutes (analytics job). The migration tool has no lock_timeout set. How do you triage and mitigate, and how do you prevent it?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.