Question
Your fraud-decision service calls a third-party device-reputation vendor synchronously on every login. At 16:10 your login p99 climbs from 220ms to 2.4s and your thread pool saturation alarm fires, but error rate is flat at baseline and the vendor's status page is green. Dashboards: vendor call success is 100% (all HTTP 200), but the vendor's response-time histogram shifted — median moved from 40ms to 900ms with a fat tail to ~3s; no 5xx, no timeouts (your client timeout is 5s). Your own CPU and DB are healthy. No deploy on your side. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.