Code Room
On-callMedium
Question
Your Java service that aggregates data from three downstream APIs starts returning 503s and `RejectedExecutionException` under normal traffic. The dashboards show the request thread pool at 100% busy with a long queue, but service CPU is only 45% and GC looks healthy. One of the three downstream APIs (the 'pricing' service) has p99 quietly climbed from 50ms to 2.5s over the last 20 minutes. Your service calls all three downstreams synchronously on the same thread pool. How do you triage and stop your service from falling over?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.