On-callHardoc-g652

Subject Storage rebalance stuckLevel Mid–Senior~35 minCommon in Storage & CDN · Distributed systems interviewsIndustries Technology

Question

You added two nodes to a 10-node sharded storage cluster to relieve capacity pressure. 18 hours later on-call is paged: the rebalance/data-migration is stuck at ~62% complete and has stopped progressing; meanwhile the three originally-hottest nodes are at 91% disk and one is approaching full; client write latency on those shards is degrading. Dashboards: rebalance throughput dropped to ~0 bytes/s about 3 hours ago; one moving shard shows repeated transfer-retry/timeout errors; the new nodes are at only 18% disk (barely received data); a specific large shard (1.2 TB) keeps failing mid-transfer; network between racks is fine. The cluster won't shed load off the hot nodes until the move completes. How do you triage a stuck rebalance after a node add?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.