On-callHardoc-g296

Subject Dependency outagesLevel Senior–Staff~40 minCommon in Reliability & on-call · Code quality & review interviewsIndustries Technology

Question

Your order service calls a downstream inventory service that is slow but NOT down — its p99 crept from 60ms to 1.2s at 14:00 and has held there. Within ten minutes your order service, which makes many calls that don't even touch inventory, is failing broadly: p99 to 8s, 30% errors on endpoints that have nothing to do with inventory. Dashboards: inventory itself is at a steady 1.2s, not getting worse; your order service's CPU is low; its outbound connection pool to inventory is pegged at 100% utilization with a long wait queue. No deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.