On-callHardoc-g510

Subject Autoscaling failureLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

An internal API runs on Kubernetes with both a HorizontalPodAutoscaler (target 60% CPU, 4-30 pods) and a VerticalPodAutoscaler in `Auto` mode that resizes pod CPU/memory requests. Since both were enabled last week, the deployment thrashes: pods are evicted and recreated every few minutes, replica count swings, and p99 has a jagged saw-tooth. `kubectl get events` is full of VPA evictions and HPA scale events interleaved. Each control loop on its own looked fine in staging. How do you triage and stabilize this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.