Question
Design a system that, when a top-line metric (e.g. checkout success rate) drops, automatically localizes which dimension is responsible — a specific region, app version, device type, payment provider, or their combination — across a metric with 200 dimensions and millions of dimension-value combinations. Engineers currently slice dashboards by hand for 30 minutes during incidents. The system should surface the responsible slice within ~1 minute of the drop. Design the detection trigger, the dimensional search, and how you avoid spurious explanations.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.