On-callHardoc-g253

Subject Latency spikesLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

A low-latency market-data fan-out service regressed after a refactor that replaced a single global counter with a small array of per-shard counters 'to reduce contention.' Counterintuitively, throughput dropped ~30% and p99 got worse under high core counts, even though each thread now writes only to its own array slot and there's no lock. CPU is high but instructions-per-cycle dropped sharply; perf counters show a large rise in L2/L3 cache coherence traffic and 'HITM' (modified-cache-line) events. No allocation, no GC, no lock contention. How do you triage and fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.