Question
Your async image-processing fleet autoscales on CPU (target 70%), pulling jobs from a queue. Since a partner integration went live, the queue backlog has grown to 400k messages and end-to-end processing SLA is breached, yet the fleet is sitting at 12 of a possible 60 workers and CPU is only ~45%. The HPA is not scaling out. Workers spend most of their time blocked on a slow downstream thumbnail-storage API (high wait, low CPU). Queue age dashboards show oldest-message age climbing linearly. Recent context: nothing deployed to the workers; the partner simply tripled inbound volume. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.