Code Room
On-callHard
Question
A request flows gateway → A → B → C. Each hop has a 3s client timeout. At 19:00, C develops a mild tail: most calls are fast but p99.9 hits 2.8s. You'd expect this to be harmless — 2.8s is under every 3s timeout. Instead, the gateway's error rate climbs to 8% and you see wasted work: C completes requests successfully that the gateway has already given up on, and B and A both retry. Dashboards: C's success rate is ~100% (it eventually answers), but its load is climbing; the gateway sees timeouts. No deploy on C; it's just a tail. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.