Code Room
On-callHard
Question
A data migration kicks off at 01:00 to backfill a new 'normalized_email' column across 80M user rows. It's written as a one-shot script that loops over every row, and for each row it enqueues a downstream 'reindex user' job to the search-index queue. By 01:20 the search-index queue depth is 30M and climbing, the search workers are saturated, and live user search starts returning stale results and timing out. The primary DB looks fine. The migration has no rate limiting and no checkpointing. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.