Code Room
On-callMediumoc-g097
Subject Capacity incidentsLevel Mid–Senior~35 minCommon in Concurrency · Reliability & on-call · Distributed systems interviewsIndustries Technology

Question

Your Java service that aggregates data from three downstream APIs starts returning 503s and `RejectedExecutionException` under normal traffic. The dashboards show the request thread pool at 100% busy with a long queue, but service CPU is only 45% and GC looks healthy. One of the three downstream APIs (the 'pricing' service) has p99 quietly climbed from 50ms to 2.5s over the last 20 minutes. Your service calls all three downstreams synchronously on the same thread pool. How do you triage and stop your service from falling over?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.