Question
Your API service calls an upstream profile service through a fixed-size HTTP connection pool (max 50 connections). At 17:00 the profile service had a brief 30s slowdown and is now fully recovered (its p99 back to 40ms). But your API's latency and timeouts haven't recovered — endpoints that call profile still show p99 ~3s and ~8% timeouts 10 minutes later, while endpoints that DON'T call profile are fine. Dashboards: your pool's 'connections in use' is pinned at 50 (100%) and 'wait queue depth' is high; profile-service inbound latency is healthy. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.