Question
Your async image-processing pipeline consumes jobs from an SQS queue and writes thumbnails to a third-party storage/CDN API. At 13:00 the queue's ApproximateNumberOfMessagesVisible starts climbing steadily — from a few hundred to 250k over an hour — and the oldest-message-age is now 45 minutes (your SLA is 2 minutes). Consumer pods are healthy and CPU is moderate, NOT pegged. There's no poison message — every job eventually succeeds, just slowly. Per-job processing time has quietly tripled. Logs show the third-party storage API is returning a lot of HTTP 429 (rate limited) and your client is retrying with backoff, so each job takes much longer than usual. A marketing campaign drove a 4x upload spike this morning. How do you triage and resolve this backlog?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.