Code Room
On-callMedium
Question
The Java orders API (Tomcat, 200-thread pool) starts returning 503s and the load balancer marks half the fleet unhealthy. Dashboards: active-threads gauge pinned at 200 on every pod, request queue depth climbing, but app CPU is only 25% and GC is calm. The threads dump shows ~190 threads parked in `SocketRead0` on calls to the third-party tax-calculation API. That tax vendor posted a status-page notice 10 minutes ago: 'elevated latency.' No deploy today. How do you respond?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.