On-callHardoc-g521

Subject Capacity incidentsLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your multi-tenant SaaS runs many customers' workloads on a shared Kubernetes cluster. This afternoon several unrelated tenants simultaneously report slow API responses and a few 503s. Your own service's request rate is normal, CPU on its pods is fine, but node-level dashboards show one node's CPU and disk I/O pegged, and the affected pods all happen to be scheduled on or near that node. One tenant just kicked off a heavy export/backfill. How do you triage and mitigate, and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.