Question
After a bad config push corrupted on-disk caches across a fleet, your auto-remediation kicked in and 400 stateful storage nodes simultaneously began restoring their datasets from the central backup/object store to rebuild. On-call is paged for a second, worse incident: the shared backup/object store and its network egress are saturated; restore throughput per node collapsed so each restore now ETAs at 9+ hours instead of 40 minutes; the backup store's request-throttling (429s) is spiking; healthy nodes serving live traffic are also slowing because they share the same storage backend and network. The fix for the first incident is now causing a thundering-herd 'restore storm'. How do you triage and control a mass-restore storm?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.