Question
A Java/Tomcat service that's been fine for months starts cliffing at p99 every time traffic crosses ~150 concurrent requests: p50 holds at 6ms, p99 jumps from 30ms to 1.2s, and it recovers the instant load drops. CPU sits at ~30% across 16 cores. The dashboard shows context switches spiking to 220k/s (baseline 25k) and a thread dump taken during the spike shows ~120 worker threads all BLOCKED on the same monitor: `synchronized` inside a custom JUL/Logback appender that every request hits twice (request-in, request-out) to write an audit line. The only recent change was raising Tomcat `maxThreads` from 100 to 400 last week to 'absorb spikes.' How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.