Code Room
On-callHard
Question
Your search-results page (Java) issues a scatter-gather fan-out to 50 shard servers and waits for all 50 before responding. Each shard's individual p99 is a healthy 15ms, but the page's end-to-end p99 is 220ms and the p99.9 is over 1s — far worse than any single shard. There was no deploy. A node-replacement event this morning left a handful of shard replicas serving from cold caches and one replica on a host with a noisy neighbor. Explain the math, triage, and propose mitigations.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.