Code Room
On-callMedium
Question
A request flows A → B → C. C has intermittently slow calls (its p99 is 6s). A's client timeout to B is 2s with 2 retries; B's client timeout to C is 5s with 2 retries. When C is slow, you observe: A times out at 2s and retries, but B is still happily working on the first request (and will keep going up to 5s × retries), and the backend ends up doing far more work for C than the user ever waits for. C's load climbs even though user traffic is flat. Triage and explain the timeout/retry misconfiguration.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.