How a simple "Add Column" script can take down production for hours.
When you need to add a new column to a table, you run a Data Definition Language (DDL) migration like ALTER TABLE. To ensure the table structure isn't corrupted, the Database requires a Heavy Lock (Access Exclusive). No one can read or write to the table while the lock is held. If you add a column with a DEFAULT value to a table with 50 million rows, the database might take 20 minutes to physically rewrite all 50 million rows. During those 20 minutes, every single user query is blocked, the connection pool fills up, and the entire website crashes.
PostgreSQL and MySQL process locks in a Queue. If a long-running SELECT query is currently running on the table, the ALTER TABLE migration has to wait in line. But here's the catch: the ALTER TABLE migration demands an Exclusive Lock, so every single query that arrives after it is blocked behind it! A migration waiting just 30 seconds for an old query to finish can cause a catastrophic pile-up.
-- The BAD way (Takes 20 minutes, blocks everything)
ALTER TABLE users ADD COLUMN status VARCHAR DEFAULT 'active';
-- The GOOD way (Instantly adds the column, zero downtime)
-- Step 1: Add the column with NO default (takes 1 millisecond)
ALTER TABLE users ADD COLUMN status VARCHAR;
-- Step 2: Set the default for FUTURE inserts
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
-- Step 3: Update existing rows slowly in the background
UPDATE users SET status = 'active' WHERE id BETWEEN 1 AND 1000;
-- (Sleep, then do the next 1000...)
Note: Since Postgres 11, adding a default with a constant value is actually instant! But adding an index or a dynamic default still requires extreme care.
Zero-downtime migrations require much more developer effort. You often have to split a single change into 3 or 4 separate deployments (e.g. Add column -> Deploy code to write to both -> Backfill old data -> Deploy code to read from new -> Drop old column). It slows down feature delivery to guarantee uptime.
SET lock_timeout TO '2s';. If the migration can't grab the lock in 2 seconds, it instantly fails and aborts, saving your database from a traffic pile-up!