Code Room
System designHard
Question
Design how a metrics platform records and serves latency percentiles for 100,000 service endpoints. Engineers want p50/p90/p99/p99.9 for any endpoint, and crucially want to aggregate percentiles across arbitrary groupings (all endpoints in a region, a whole service) computed on the fly. Averaging stored per-endpoint percentiles is mathematically wrong, but storing every raw latency is too expensive. Design the data representation, the aggregation, and the accuracy/cost trade-off.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.