On-callMediumoc-g261

Subject Tail latencyLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

After a debugging session left log level at DEBUG in prod, your service's p99 jumped from 20ms to 300ms under load while p50 barely moved. Flame graphs show request threads spending significant time inside synchronous logging calls, and disk write latency on the log volume is elevated and bursty. CPU is moderate. The logging is configured synchronous and the appender fsyncs. Throughput is down. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.