On-callHardoc-g029

Subject Tail latencyLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

Your search-results page (Java) issues a scatter-gather fan-out to 50 shard servers and waits for all 50 before responding. Each shard's individual p99 is a healthy 15ms, but the page's end-to-end p99 is 220ms and the p99.9 is over 1s — far worse than any single shard. There was no deploy. A node-replacement event this morning left a handful of shard replicas serving from cold caches and one replica on a host with a noisy neighbor. Explain the math, triage, and propose mitigations.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.