On-callHardoc-g066

Subject Schema migrationLevel Senior–Staff~40 minCommon in Code quality & review · Algorithms & data structures interviewsIndustries Technology, Software development

Question

A data migration kicks off at 01:00 to backfill a new 'normalized_email' column across 80M user rows. It's written as a one-shot script that loops over every row, and for each row it enqueues a downstream 'reindex user' job to the search-index queue. By 01:20 the search-index queue depth is 30M and climbing, the search workers are saturated, and live user search starts returning stale results and timing out. The primary DB looks fine. The migration has no rate limiting and no checkpointing. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.