On-callHardoc-g444

Subject Tail latencyLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your team's nightly closed-loop load test signs off the service at 50k RPS with a reported p99 of 22ms and p99.9 of 35ms — green every night. Yet production at the same RPS shows p99 of 180ms and a steady trickle of client-side timeouts, and customers complain about freezes. The load-test harness is a single-threaded-per-connection closed-loop tool: it sends a request, waits for the response, then sends the next. The service does have occasional ~400ms stalls (a periodic background compaction). Production clients are open-loop (independent arrivals). How do you reconcile the numbers and decide whether there's a real problem?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.