On-callHardoc-g589

Subject Storage node down rebalanceLevel Senior–Staff~40 minCommon in Databases & SQL · Storage & CDN · Distributed systems interviewsIndustries Technology

Question

A node in your distributed storage cluster (Ceph-like, 3x replication) goes hard-down at 09:10. The cluster automatically starts re-replicating that node's data onto the remaining nodes to restore 3 copies. Within minutes, client write latency degrades sharply, network and disk on the surviving nodes saturate from the backfill, and a second node is now flapping under the load — raising the specter of a cascading failure that could threaten durability. You can't bring the dead node back (hardware fault). How do you triage, stabilize the cluster, and avoid losing redundancy or the whole cluster?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.