Question
You receive payment webhooks from a vendor to mark orders paid. At 14:30 a fraction of orders sit in 'pending payment' though customers were charged. The vendor's status page is green and they confirm they're sending. Dashboards: your webhook endpoint is returning 200s and inbound webhook volume looks roughly normal — but your endpoint's p99 ACK latency crept from 80ms to 6s starting 14:25, and the vendor's delivery dashboard (shared) shows it backing off and DELAYING redeliveries because your slow 200s look like near-failures. No vendor outage; you deployed a webhook-handler change at 14:20. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.