On-callHardoc-g507

Subject Autoscaling failureLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your checkout service autoscales on a custom metric: a 5-minute rolling average of request latency exported to the HPA via the metrics adapter. During a steadily rising promotional ramp, the service is consistently a step behind — every few minutes p99 breaches SLA, the HPA then adds pods, latency recovers, the rolling average finally catches up and the HPA scales back down, and the cycle repeats so the fleet is always slightly under-provisioned during the ramp. The HPA event log shows steady scale-up/scale-down churn; CPU and memory are moderate; demand is monotonically increasing all afternoon. How do you triage this and stop the chronic under-provisioning?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.