On-callHardoc-g243

Subject Lock convoyLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

A .NET order-pricing service that's been stable for a year suddenly develops a p99 latency cliff during the morning rush: p50 stays at 8ms but p99 jumps from 40ms to 900ms once concurrent requests cross ~200. CPU is only at 35% across all cores, GC is quiet, and downstream calls are fast. The dashboard shows a high context-switch rate (180k/s, up from 20k) and a large number of threads in the 'wait' state on a single `lock` around an in-memory price-table cache that every request reads under the lock. The only recent change was a config bump raising the thread-pool min from 50 to 400 to 'handle more load.' How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.