Code Room
System designHard
Question
Design a Prometheus-compatible metrics ingestion and query backend for a fleet of 50,000 hosts emitting 20M active time series, scraped every 15s. Steady-state ingest is ~1.3M samples/sec with bursts to 4M during deploys. Queries are mostly dashboard range reads over the last 6h (p99 < 500ms) plus a long tail of alerting rule evaluations every 30s. Raw resolution must be kept 15 days, 5-minute rollups 13 months. Walk the write path, the storage layout, and how you keep query latency bounded as cardinality grows.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.