On-callHardoc-g311

Subject Quota exhaustionLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

During a planned product launch your service tries to scale from 80 to 200 pods. The HPA requests the replicas, but new pods get stuck in Pending and the cluster-autoscaler logs show it adding nodes — yet only ~30 of the 120 new pods ever schedule. Latency and error rate climb because the fleet is stuck at ~110 effective replicas. Node group capacity is nowhere near its max. Digging in, the cluster events show `FailedScheduling: 0/x nodes available: insufficient IPs` on the pending pods and the VPC subnet's available-IP gauge reads near zero. The autoscaler keeps trying. How do you triage and mitigate the launch, and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.