Code Room
On-callMedium
Question
You receive payment webhooks from a third-party provider. Your handler records the payment and ships the order. The provider retries any webhook it doesn't get a 2xx for within 10s. Starting at 12:00, a slow downstream made your handler take 12–15s; afterward, ops finds duplicate shipments and duplicate ledger rows for a subset of payments. Dashboards: handler error rate flat (it returns 200, just slowly), order/shipment volume elevated. How do you triage, stop the duplicate side-effects, and reconcile?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.