On-callHardoc-g114

Subject Backfill stormLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A one-off backfill to populate a new `risk_score` attribute on every account was launched as a fan-out of Lambda workers reading and writing a provisioned-capacity DynamoDB table that also serves live traffic. Within minutes: live API 5xx rate jumps to 25%, DynamoDB `ThrottledRequests` and `UserErrors` spike, `ConsumedWriteCapacity` is pinned at the provisioned ceiling, and the backfill's own DLQ is filling because workers retry-storm on throttles. Live latency-sensitive reads are timing out. The backfill is at ~8% complete and 6,000 Lambdas are running concurrently. How do you triage, protect live traffic, and complete the backfill safely?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.