On-callHardoc-g455

Subject Tail latencyLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A JVM service has periodic p99 pauses of 200-600ms. You assume it's GC, but the GC logs show the actual collection 'stop-the-world' portion is tiny (sub-10ms) — yet the GC log line's total pause TIME is large, and `-XX:+PrintSafepointStatistics` shows long 'time to safepoint' (the 'spin'+'block'+'sync' phases) before collections and other safepoint operations. The service has a few very hot tight loops doing big array crunching. p50 is great; CPU is fine; no allocation spike. How do you triage and what's actually happening?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.