Code Room
System designMedium
Question
Design an SLO monitoring and error-budget system for a payments API with a 99.95% availability and a 'p99 latency < 300ms' SLO measured over a rolling 28-day window. The platform serves 40k req/s. You need to compute compliance, show remaining error budget, and page on-call only when the budget is burning fast enough to matter — not on every blip. Design the measurement, the budget computation, and the alerting policy.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.