Code Room
System designHard
Question
Design a general-purpose telemetry collection pipeline (OpenTelemetry-style) that receives metrics, logs, and traces from 100k agents totaling 4GB/s, normalizes/enriches them, and fans them out to multiple backends (a metrics TSDB, a log store, a trace store, and a cold archive in object storage). One backend (the trace store) is regularly slow or briefly down. The pipeline must not lose data when a backend is degraded and must not let a slow backend stall ingestion of the others. Design the stages, buffering, and per-backend isolation.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.