Question
A .NET order-pricing service that's been stable for a year suddenly develops a p99 latency cliff during the morning rush: p50 stays at 8ms but p99 jumps from 40ms to 900ms once concurrent requests cross ~200. CPU is only at 35% across all cores, GC is quiet, and downstream calls are fast. The dashboard shows a high context-switch rate (180k/s, up from 20k) and a large number of threads in the 'wait' state on a single `lock` around an in-memory price-table cache that every request reads under the lock. The only recent change was a config bump raising the thread-pool min from 50 to 400 to 'handle more load.' How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.