Code Room
On-callHard
Question
A request flows gateway → A → B → C (all gRPC). The gateway sets a 2s deadline. At 20:00, C develops a latency tail (p99.9 ~5s). The gateway times out at 2s and returns errors to users — expected. But you ALSO see C and B CPU climbing and their inbound request rate rising, even though the callers that triggered the work already gave up at 2s. Dashboards: B and C keep processing requests whose gateway deadline has long passed; retries from A add more load. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.