Question
A Java service backed by a fixed 200-thread request pool starts returning 503s ('pool exhausted / task rejected') at 15:20; p99 on *every* endpoint balloons to the 30s timeout, even endpoints that don't touch the database. CPU sits at 20% and memory is flat — the box is nearly idle. A thread dump shows ~195 of 200 worker threads BLOCKED/parked in a synchronous call to one downstream recommendations API. That recommendations API got slow (its own incident) starting at 15:15. Throughput on your service has collapsed even though nothing about your service changed. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.