Code Room
On-callMedium
Question
A migration to add an index on a large 'events' table (MySQL 8, primary + 2 read replicas) runs at 03:00 using a standard online DDL. The primary stays healthy and the migration completes there in 6 minutes. But starting at 03:00, replica lag on both read replicas climbs to 40 minutes, and the app — which reads recent events from replicas — starts showing users stale/missing data and the analytics dashboards go blank. After the index finishes replicating, lag drains and everything recovers. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.