On-callHardoc-g263

Subject P99 regressionLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A bursty, spiky-traffic API on an HPA that scales on average CPU keeps oscillating: it scales up during a burst, then the autoscaler scales it back down aggressively in the lull, and the next burst hits an under-provisioned fleet that must cold-start pods (each ~30s to ready) — so p99 spikes on every burst and the fleet visibly flaps up and down all day. Steady-state load would be fine with a stable replica count. The HPA was recently retuned to be 'more cost-efficient' with a low target and short stabilization window. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.