Question
Your checkout service autoscales on a custom metric: a 5-minute rolling average of request latency exported to the HPA via the metrics adapter. During a steadily rising promotional ramp, the service is consistently a step behind — every few minutes p99 breaches SLA, the HPA then adds pods, latency recovers, the rolling average finally catches up and the HPA scales back down, and the cycle repeats so the fleet is always slightly under-provisioned during the ramp. The HPA event log shows steady scale-up/scale-down churn; CPU and memory are moderate; demand is monotonically increasing all afternoon. How do you triage this and stop the chronic under-provisioning?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.