Code Room
On-callMedium
Question
Your async job system processes webhooks via SQS + a worker fleet that autoscales on CPU. A burst of inbound webhooks (a partner did a backfill) pushes the SQS `ApproximateNumberOfMessagesVisible` from a few hundred to 1.2M. Queue age climbs to 40 minutes and downstream partners start complaining about delayed callbacks, but the worker autoscaler barely adds capacity — worker CPU sits around 35% because each job spends most of its time waiting on an external HTTP call. How do you triage and clear the backlog?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.