Question
An inventory-allocation service that uses optimistic concurrency (version column + compare-and-set, retry on conflict) gets stuck during a flash sale. Dashboards: CPU is high (60-80%), the service is busy and threads are running — nothing is blocked or deadlocked — yet successful allocations per second drop almost to zero. The retry-counter metric explodes: the average transaction is retrying 40+ times before giving up. The DB shows tons of `version mismatch` conflicts on a few popular SKUs but no lock waits. Restarting doesn't help. Triage and explain why it isn't a deadlock.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.