Code Room
System designMediumsd-g158
Subject Alerting systemsLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Design the alerting and notification routing layer that sits behind a metrics platform for 2,000 engineering teams. Rule evaluators across multiple regions fire ~30k alert state-changes/min during a large outage (normally ~200/min). You must deduplicate identical alerts from redundant evaluators, group related alerts into a single notification, route to the right on-call via escalation policies, and never drop or double-page. Design state management, dedup, grouping/throttling, and delivery guarantees.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.