Code Room
On-callMedium
Question
A B2B SaaS app was fine for a year, then this week — after onboarding a large enterprise customer (tenant_id 4471 with 20M rows vs the typical 50k) — the dashboard-load endpoint times out, but only for that tenant. Other tenants are snappy. DB CPU and I/O climb noticeably whenever 4471 loads a dashboard. `EXPLAIN ANALYZE` of the dashboard query shows a sequential scan filtering `WHERE tenant_id = 4471 AND created_at > now() - interval '30 days'` over a 60M-row events table; there's an index on `tenant_id` alone but none on `(tenant_id, created_at)`. Triage and fix.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.