Question
Design compute isolation for a multi-tenant query/analytics platform where 30,000 tenants share a pool of worker nodes and one tenant's runaway query (a cartesian join scanning TBs, or 10,000 concurrent requests) must NOT degrade latency for everyone else — but you also can't afford a dedicated cluster per tenant. This is performance isolation, not data isolation. Discuss the scheduling/quota model, how you detect and contain a noisy neighbor in real time, the trade-off between hard isolation (separate resources) and soft isolation (shared with fairness), and how you keep small tenants responsive.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.