Code Room
On-callHardoc-g214
Subject Distributed failuresLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

A user-facing request fans out through 4 service hops (gateway → A → B → C). At 17:10 you see massive load on service C and timeouts at the gateway. Dashboards: each hop has its own 2s timeout and retries twice on timeout; C's actual p99 is 1.8s but it's receiving ~8x its normal request volume; user traffic is flat; the gateway's own deadline is 2s. A latency bump on C started at 17:05 (a slow query). How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.