Question
A node in your distributed storage cluster (Ceph-like, 3x replication) goes hard-down at 09:10. The cluster automatically starts re-replicating that node's data onto the remaining nodes to restore 3 copies. Within minutes, client write latency degrades sharply, network and disk on the surviving nodes saturate from the backfill, and a second node is now flapping under the load — raising the specter of a cascading failure that could threaten durability. You can't bring the dead node back (hardware fault). How do you triage, stabilize the cluster, and avoid losing redundancy or the whole cluster?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.