On-callMediumoc-g560

Subject On callLevel Mid–Senior~35 minCommon in Reliability & on-call · Distributed systems · Algorithms & data structures interviewsIndustries Technology

Question

Your async image-processing pipeline consumes jobs from an SQS queue and writes thumbnails to a third-party storage/CDN API. At 13:00 the queue's ApproximateNumberOfMessagesVisible starts climbing steadily — from a few hundred to 250k over an hour — and the oldest-message-age is now 45 minutes (your SLA is 2 minutes). Consumer pods are healthy and CPU is moderate, NOT pegged. There's no poison message — every job eventually succeeds, just slowly. Per-job processing time has quietly tripled. Logs show the third-party storage API is returning a lot of HTTP 429 (rate limited) and your client is retrying with backoff, so each job takes much longer than usual. A marketing campaign drove a 4x upload spike this morning. How do you triage and resolve this backlog?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.