Question
A one-off backfill to populate a new `risk_score` attribute on every account was launched as a fan-out of Lambda workers reading and writing a provisioned-capacity DynamoDB table that also serves live traffic. Within minutes: live API 5xx rate jumps to 25%, DynamoDB `ThrottledRequests` and `UserErrors` spike, `ConsumedWriteCapacity` is pinned at the provisioned ceiling, and the backfill's own DLQ is filling because workers retry-storm on throttles. Live latency-sensitive reads are timing out. The backfill is at ~8% complete and 6,000 Lambdas are running concurrently. How do you triage, protect live traffic, and complete the backfill safely?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.