Question
During a planned product launch your service tries to scale from 80 to 200 pods. The HPA requests the replicas, but new pods get stuck in Pending and the cluster-autoscaler logs show it adding nodes — yet only ~30 of the 120 new pods ever schedule. Latency and error rate climb because the fleet is stuck at ~110 effective replicas. Node group capacity is nowhere near its max. Digging in, the cluster events show `FailedScheduling: 0/x nodes available: insufficient IPs` on the pending pods and the VPC subnet's available-IP gauge reads near zero. The autoscaler keeps trying. How do you triage and mitigate the launch, and what's the durable fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.