On-callHardoc-g443

Subject P99 regressionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

After a deploy at 14:00, a long-running JVM service shows p99 on a hot endpoint regress from 25ms to 110ms — but NOT at startup. The pods warm up fine and run clean for ~20-40 minutes, then each one's p99 steps up to the new bad level and stays there for the rest of its life. p50 is unchanged. CPU is flat, GC is quiet, no allocation change, downstreams are fast. The change in this deploy added a new branch to a very hot method to handle a rare new input type. JIT compilation logs (`-XX:+PrintCompilation`) show the hot method gets made-not-entered / deoptimized and recompiled around the time each pod degrades, and the recompiled version is never re-optimized to the prior tier. How do you triage and explain this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.