On-callHardoc-g085

Subject Upstream timeoutLevel Senior–Staff~45 minCommon in Networking & APIs · Distributed systems interviewsIndustries Technology

Question

A user request flows through gateway → service-A → service-B → service-C (all gRPC). At 21:00, service-C develops a latency tail (p99.9 ~4s). Soon the whole chain is melting down: the gateway times out, but downstream you observe service-B and service-C still busily processing requests whose callers have already given up. Dashboards: huge 'context deadline exceeded' counts; service-B/C CPU is high doing 'wasted' work; goroutine/thread counts climbing across the chain. Recent context: timeouts are configured per-hop as fixed values (each hop has its own 5s timeout) with no deadline propagation. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.