Question
A routine migration to add a column ran during a low-traffic window and 'hung'. Within a minute, the entire `users` table became effectively unavailable — every read and write to it is timing out, app error rate spiked to 100% on those endpoints, and the migration's `ALTER TABLE` is still `active`. `pg_locks` shows the ALTER waiting on an `AccessExclusiveLock`, and behind it a long queue of normally-fast SELECTs and UPDATEs all waiting. There's one old transaction at the front holding an `AccessShareLock` on `users` that's been open for 40 minutes. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.