On-callMediumoc-g219

Subject Gray failureLevel Mid–Senior~35 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

Over the past 3 days, one Java service has slowly gotten worse: latency creeps up daily and one pod per day needs a restart to recover, but it never fully 'fails'. Dashboards: heap is fine and GC is normal, but the JVM thread count climbs steadily from ~200 to ~2000 over ~20 hours before that pod's latency degrades; a downstream HTTP client is created per-request and its connections sometimes don't close; restarting the pod resets thread count and latency. No deploy in the last week. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.