Code Room
On-callHard
Question
A user-facing request fans out through 4 service hops (gateway → A → B → C). At 17:10 you see massive load on service C and timeouts at the gateway. Dashboards: each hop has its own 2s timeout and retries twice on timeout; C's actual p99 is 1.8s but it's receiving ~8x its normal request volume; user traffic is flat; the gateway's own deadline is 2s. A latency bump on C started at 17:05 (a slow query). How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.