On-callHardoc-g441

Subject Lock convoyLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

A Java/Tomcat service that's been fine for months starts cliffing at p99 every time traffic crosses ~150 concurrent requests: p50 holds at 6ms, p99 jumps from 30ms to 1.2s, and it recovers the instant load drops. CPU sits at ~30% across 16 cores. The dashboard shows context switches spiking to 220k/s (baseline 25k) and a thread dump taken during the spike shows ~120 worker threads all BLOCKED on the same monitor: `synchronized` inside a custom JUL/Logback appender that every request hits twice (request-in, request-out) to write an audit line. The only recent change was raising Tomcat `maxThreads` from 100 to 400 last week to 'absorb spikes.' How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.