Code Room
System designMedium
Question
Design the alerting and notification routing layer that sits behind a metrics platform for 2,000 engineering teams. Rule evaluators across multiple regions fire ~30k alert state-changes/min during a large outage (normally ~200/min). You must deduplicate identical alerts from redundant evaluators, group related alerts into a single notification, route to the right on-call via escalation policies, and never drop or double-page. Design state management, dedup, grouping/throttling, and delivery guarantees.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.