Code Room
On-callMedium
Question
After a debugging session left log level at DEBUG in prod, your service's p99 jumped from 20ms to 300ms under load while p50 barely moved. Flame graphs show request threads spending significant time inside synchronous logging calls, and disk write latency on the log volume is elevated and bursty. CPU is moderate. The logging is configured synchronous and the appender fsyncs. Throughput is down. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.