Code Room
On-callMedium
Question
During a flash on-sale at 10:00 your ticketing service (Java + Hibernate on Postgres) starts throwing intermittent 'deadlock detected' and 'could not serialize access' errors, and order success rate drops to 70%. The hottest SKU is a single popular event row whose available_count column every purchase decrements with SELECT ... FOR UPDATE then UPDATE. Throughput is fine for unpopular events. CPU and connections are healthy. How do you triage and mitigate during the live event, and fix it afterward?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.