Question
At 14:05 the checkout service stops serving: P99 latency jumps from 80ms to a flat 30s timeout wall, and the request-in-flight gauge climbs monotonically to the connection cap, then plateaus. No errors in the app logs at first — just nothing completing. The DB shows a spike in `pg_stat_activity` rows stuck in state `idle in transaction` and a handful of `ProcessUtility` waits. A deploy went out at 14:00 that added a 'reserve inventory then debit wallet' flow alongside the existing 'debit wallet then reserve inventory' coupon path. CPU on both app and DB is near idle. Walk me through how you triage and stabilize this.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.