Code Room
On-callHard
Question
Your multi-tenant SaaS runs many customers' workloads on a shared Kubernetes cluster. This afternoon several unrelated tenants simultaneously report slow API responses and a few 503s. Your own service's request rate is normal, CPU on its pods is fine, but node-level dashboards show one node's CPU and disk I/O pegged, and the affected pods all happen to be scheduled on or near that node. One tenant just kicked off a heavy export/backfill. How do you triage and mitigate, and what's the durable fix?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.