On-callHardoc-g629

Subject Lock contentionLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology

Question

A real-time ad-bidding service runs fine at 8k req/s but at a marketing event traffic ramps to 11k req/s and latency doesn't degrade gracefully — it falls off a cliff: median goes from 12ms to 900ms over about 90 seconds, then stays there even as you add pods. The DB shows enormous `lock_wait` time concentrated on UPDATEs to a single `campaign_budgets` row for the headline campaign that's getting 70% of the event traffic. CPU per pod actually drops as you scale out. Adding capacity made it worse. Triage and explain the dynamics.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.