Question
Your service's dashboards report a healthy p99 of 25ms and the load test signs off at 30k RPS with p99 under 40ms, yet customers and the upstream gateway report frequent multi-hundred-ms stalls and timeouts during peak. When you overlay the gateway's client-side latency histogram on your server-side one, the server side looks great but the client side shows a fat tail and periodic gaps where almost no responses come back at all. The load generator is a closed-loop client that sends the next request only after the previous response. There was no code change — peak traffic just grew 20%. How do you triage this 'phantom' p99 regression?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.