Question
A schema migration adds a `currency` column (defaulting writes to a per-row value) across a 16-shard MySQL fleet for an `orders` table. The migration runner applies DDL shard-by-shard. At 03:20 it failed partway: shards 0–9 have the new column, shards 10–15 don't. The runner reported failure and paged, but the API kept serving. Now reads are inconsistent: orders on migrated shards return a currency, orders on un-migrated shards 500 on the new code path that expects the column, and a reporting job is summing amounts across currencies as if they were all USD. How do you triage, stabilize, and safely complete the migration?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.