Code Room
On-callHard
Question
An internal API runs on Kubernetes with both a HorizontalPodAutoscaler (target 60% CPU, 4-30 pods) and a VerticalPodAutoscaler in `Auto` mode that resizes pod CPU/memory requests. Since both were enabled last week, the deployment thrashes: pods are evicted and recreated every few minutes, replica count swings, and p99 has a jagged saw-tooth. `kubectl get events` is full of VPA evictions and HPA scale events interleaved. Each control loop on its own looked fine in staging. How do you triage and stabilize this?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.