Question
A user request flows through gateway → service-A → service-B → service-C (all gRPC). At 21:00, service-C develops a latency tail (p99.9 ~4s). Soon the whole chain is melting down: the gateway times out, but downstream you observe service-B and service-C still busily processing requests whose callers have already given up. Dashboards: huge 'context deadline exceeded' counts; service-B/C CPU is high doing 'wasted' work; goroutine/thread counts climbing across the chain. Recent context: timeouts are configured per-hop as fixed values (each hop has its own 5s timeout) with no deadline propagation. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.