On-callHardoc-g202

Subject Cascading failureLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your monolith talks to 6 microservices through a shared HTTP client connection pool (max 200). At 19:40 the whole app's p99 goes to 30s and error rate spikes across ALL endpoints, even ones unrelated to search. Dashboards: the recommendations service (one of the 6) has p99 of 28s after a deploy at 19:35; the shared pool shows 200/200 connections in use and a growing wait queue; thread pool on the monolith is saturated. Other 5 downstreams are healthy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.