Code Room
On-callHard
Question
A JVM service has periodic p99 pauses of 200-600ms. You assume it's GC, but the GC logs show the actual collection 'stop-the-world' portion is tiny (sub-10ms) — yet the GC log line's total pause TIME is large, and `-XX:+PrintSafepointStatistics` shows long 'time to safepoint' (the 'spin'+'block'+'sync' phases) before collections and other safepoint operations. The service has a few very hot tight loops doing big array crunching. p50 is great; CPU is fine; no allocation spike. How do you triage and what's actually happening?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.