Code Room
On-callHardoc-g309
Subject Autoscaling failureLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your recommendations service runs on a Kubernetes HPA targeting 65% CPU, min 8 / max 40 pods. Since this morning the replica count oscillates wildly — scaling 8→36→8→34 every few minutes — and p99 latency has a sawtooth pattern that tracks the swings. Each scale-down evicts warm pods, and the surviving pods briefly spike to 95% CPU before new ones become ready (your readiness probe gates on a 40s warm-cache load). Dashboards show average CPU hovering right at the 65% target and the HPA event log is full of alternating SuccessfulRescale up/down events. Traffic is flat versus last week. Walk through your triage and how you'd stop the flapping.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.