Code Room
System designHardsd-g155
Subject Metrics systemsLevel Senior–Staff~45 minCommon in Distributed systems interviewsIndustries Technology

Question

Design a Prometheus-compatible metrics ingestion and query backend for a fleet of 50,000 hosts emitting 20M active time series, scraped every 15s. Steady-state ingest is ~1.3M samples/sec with bursts to 4M during deploys. Queries are mostly dashboard range reads over the last 6h (p99 < 500ms) plus a long tail of alerting rule evaluations every 30s. Raw resolution must be kept 15 days, 5-minute rollups 13 months. Walk the write path, the storage layout, and how you keep query latency bounded as cardinality grows.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.