On-callHardoc-g298

Subject Upstream timeoutLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A request flows gateway → A → B → C. Each hop has a 3s client timeout. At 19:00, C develops a mild tail: most calls are fast but p99.9 hits 2.8s. You'd expect this to be harmless — 2.8s is under every 3s timeout. Instead, the gateway's error rate climbs to 8% and you see wasted work: C completes requests successfully that the gateway has already given up on, and B and A both retry. Dashboards: C's success rate is ~100% (it eventually answers), but its load is climbing; the gateway sees timeouts. No deploy on C; it's just a tail. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.