Question
Your order service calls a downstream inventory service that is slow but NOT down — its p99 crept from 60ms to 1.2s at 14:00 and has held there. Within ten minutes your order service, which makes many calls that don't even touch inventory, is failing broadly: p99 to 8s, 30% errors on endpoints that have nothing to do with inventory. Dashboards: inventory itself is at a steady 1.2s, not getting worse; your order service's CPU is low; its outbound connection pool to inventory is pegged at 100% utilization with a long wait queue. No deploy. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.