On-callHardoc-g461

Subject Retry amplificationLevel Senior–Staff~40 minCommon in Networking & APIs interviewsIndustries Technology, Software development

Question

Your clients use a client-side least-connections / least-loaded load balancer across a fleet of backend replicas, with automatic retries on failure. One backend replica develops a problem (a slow disk) and gets slow but not down — its requests pile up. Clients retry the slow requests; the least-loaded LB, seeing the slow node's in-flight requests as 'busy', is supposed to route AWAY from it — yet you observe the OPPOSITE in one failure mode: retries keep landing on the SAME degraded node and it gets hammered into total failure while healthy nodes sit underused. How do you triage, and what load-balancing subtlety is biting you?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.