Code Room
On-callMediumoc-g156
Subject Backlog buildupLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your payment-webhook fan-out runs SQS → Lambda. At 18:50 the `ApproximateNumberOfMessagesVisible` on the `webhook-dispatch` queue jumps from ~500 to 220k and keeps climbing; `ApproximateAgeOfOldestMessage` is now 9 minutes and rising. The Lambda's `ConcurrentExecutions` is pinned flat at 1,000 and `Throttles` is non-zero and growing. A marketing push went out at 18:45 that 3x'd inbound webhook volume. Downstream the merchant-notification API (called by the Lambda) shows p99 latency up from 80ms to 1,100ms. Walk through triage and mitigation.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.