Question
A C++ high-frequency packet-counter service regressed after a 'cache-friendly' refactor that packed per-thread counters into a tight contiguous array (`counters[thread_id]++`, no locks, each thread touches only its own index). On a 32-core box, throughput is now ~40% LOWER than before, and it gets worse the more threads you add even though there are zero locks and zero data races (each thread writes a different element). `perf` shows a huge spike in L2/L3 cache misses and 'HITM' (cache-line transfer) events, and CPU is busy but unproductive. Triage and explain why a lock-free design got slower.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.