On-callMediumoc-g098

Subject Autoscaling failureLevel Mid–Senior~35 minCommon in Algorithms & data structures interviewsIndustries Technology

Question

Your async job system processes webhooks via SQS + a worker fleet that autoscales on CPU. A burst of inbound webhooks (a partner did a backfill) pushes the SQS `ApproximateNumberOfMessagesVisible` from a few hundred to 1.2M. Queue age climbs to 40 minutes and downstream partners start complaining about delayed callbacks, but the worker autoscaler barely adds capacity — worker CPU sits around 35% because each job spends most of its time waiting on an external HTTP call. How do you triage and clear the backlog?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.