On-callMediumoc-g496

Subject Upstream timeoutLevel Mid–Senior~30 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

Your API service calls an upstream profile service through a fixed-size HTTP connection pool (max 50 connections). At 17:00 the profile service had a brief 30s slowdown and is now fully recovered (its p99 back to 40ms). But your API's latency and timeouts haven't recovered — endpoints that call profile still show p99 ~3s and ~8% timeouts 10 minutes later, while endpoints that DON'T call profile are fine. Dashboards: your pool's 'connections in use' is pinned at 50 (100%) and 'wait queue depth' is high; profile-service inbound latency is healthy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.