On-callHardoc-g246

Subject P99 regressionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your service's dashboards report a healthy p99 of 25ms and the load test signs off at 30k RPS with p99 under 40ms, yet customers and the upstream gateway report frequent multi-hundred-ms stalls and timeouts during peak. When you overlay the gateway's client-side latency histogram on your server-side one, the server side looks great but the client side shows a fat tail and periodic gaps where almost no responses come back at all. The load generator is a closed-loop client that sends the next request only after the previous response. There was no code change — peak traffic just grew 20%. How do you triage this 'phantom' p99 regression?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.