Code Room
On-callMedium
Question
Over the past 3 days, one Java service has slowly gotten worse: latency creeps up daily and one pod per day needs a restart to recover, but it never fully 'fails'. Dashboards: heap is fine and GC is normal, but the JVM thread count climbs steadily from ~200 to ~2000 over ~20 hours before that pod's latency degrades; a downstream HTTP client is created per-request and its connections sometimes don't close; restarting the pod resets thread count and latency. No deploy in the last week. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.