Question
Your clients use a client-side least-connections / least-loaded load balancer across a fleet of backend replicas, with automatic retries on failure. One backend replica develops a problem (a slow disk) and gets slow but not down — its requests pile up. Clients retry the slow requests; the least-loaded LB, seeing the slow node's in-flight requests as 'busy', is supposed to route AWAY from it — yet you observe the OPPOSITE in one failure mode: retries keep landing on the SAME degraded node and it gets hammered into total failure while healthy nodes sit underused. How do you triage, and what load-balancing subtlety is biting you?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.