On-callHardoc-g113

Subject Migration gone wrongLevel Senior–Staff~40 minCommon in Databases & SQL · Concurrency · Reliability & on-call interviewsIndustries Technology

Question

A 'zero-downtime' column-type migration (`gh-ost`) on a 400M-row MySQL 8 `users` table was kicked off at peak. Twelve minutes in, the API's p99 went from 80ms to 9s and the error rate hit 30%. Dashboards: MySQL `Threads_running` pinned at the connection-pool ceiling, `History list length` (undo log) climbing fast, replica lag on the read replicas rising past 200s, the gh-ost ghost-table copy at 60%. The on-call before you already tried killing the app pods, which didn't help. What's happening, how do you stabilize, and how do you finish or unwind the migration without losing rows?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.