Code Room
System designHard
Question
Design a distributed tracing pipeline for a microservice platform of ~800 services producing 5M traces/min, average 40 spans/trace (so ~200M spans/min at peak). The business wants to keep essentially all error traces and all traces slower than the p99 latency for that endpoint, but can afford to store only ~2% of the healthy traces. Storage budget allows ~30 days hot. Design ingest, the sampling decision, and trace storage/retrieval by trace ID and by service+latency+error filters.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.