Build a copy of the table with the new shape on the side, keep it in sync while it fills, then swap the two in a blink.
Running a plain ALTER TABLE on a huge table can hold a lock for minutes or hours, and during that window every read and write to that table is stuck waiting. For a busy production table, that is an outage.
The online approach (the gh-ost / pt-online-schema-change pattern) never blocks the live table. It creates a shadow table with the new schema, copies existing rows over in small batches, captures live writes happening in the meantime so the shadow stays current, and then does a single instant rename to swap them. The original table serves traffic the entire time.
The tool builds the shadow table, attaches a way to capture every live write, then loops copying rows in primary-key ranges so each batch is small and quick. Once the shadow has caught up, a single atomic rename inside one statement makes the swap appear instant to clients.
-- 1. shadow table = same shape as original, then the change we want
CREATE TABLE users_shadow LIKE users;
ALTER TABLE users_shadow ADD COLUMN status VARCHAR(16) NOT NULL DEFAULT 'active';
-- 2. capture live writes so the shadow never drifts.
-- pt-osc uses triggers; gh-ost tails the binlog instead.
CREATE TRIGGER users_ins AFTER INSERT ON users
FOR EACH ROW REPLACE INTO users_shadow (id, name) VALUES (NEW.id, NEW.name);
-- (plus matching AFTER UPDATE / AFTER DELETE triggers)
-- 3. backfill existing rows in small PK ranges, throttling between batches
SET @lo = 0;
REPEAT
INSERT IGNORE INTO users_shadow (id, name)
SELECT id, name FROM users
WHERE id > @lo AND id <= @lo + 1000; -- one batch
SET @lo = @lo + 1000;
-- pause here if replica lag is rising
UNTIL @lo >= (SELECT MAX(id) FROM users) END REPEAT;
-- 4. atomic cut-over: both renames in one statement, sub-second
RENAME TABLE users TO users_old, users_shadow TO users;
DROP TABLE users_old; -- once you are confident
| Concern | Naive ALTER | Online migration |
|---|---|---|
| Downtime | Locked for the whole rebuild | None — only a sub-second rename |
| Wall-clock time | Faster overall | Slower, paced into batches |
| Disk overhead | Roughly one extra copy | Full second copy until swap |
| Rollback | Hard to interrupt mid-rebuild | Abort any time before the swap |
| Write capture | Not needed | Triggers (pt-osc) or binlog (gh-ost) |
INSERT IGNORE / REPLACE so the freshest version wins.Say you need to add a NOT NULL column to a 50M-row table on a busy service.
A naive ALTER TABLE rebuilds the whole table under a lock and takes about 40 minutes — 40 minutes where every write to that table errors or hangs. That is a real outage.
The online migration instead backfills in 1,000-row batches, throttling between them to keep replica lag in check, finishing in roughly 2 hours of background work with zero downtime. Along the way it captures the ~3,000 live writes that landed during the copy and applies them to the shadow, so it stays in sync. The final RENAME TABLE swap completes in under a second. Slower on the clock, but the table never stopped serving.
Why is the original table able to keep serving reads and writes for the whole migration?
Mid-migration a user updates a row that was already copied. What keeps the shadow correct?