On-callHardoc-g310

Subject Autoscaling failureLevel Senior–Staff~30 minCommon in Algorithms & data structures interviewsIndustries Technology

Question

Your async image-processing fleet autoscales on CPU (target 70%), pulling jobs from a queue. Since a partner integration went live, the queue backlog has grown to 400k messages and end-to-end processing SLA is breached, yet the fleet is sitting at 12 of a possible 60 workers and CPU is only ~45%. The HPA is not scaling out. Workers spend most of their time blocked on a slow downstream thumbnail-storage API (high wait, low CPU). Queue age dashboards show oldest-message age climbing linearly. Recent context: nothing deployed to the workers; the partner simply tripled inbound volume. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.