Question
A high-throughput Java metrics-aggregation service regressed after a 'cleanup' refactor that grouped several hot per-request `AtomicLong` counters (request count, byte count, a hot config-version field, an error count) together as adjacent fields in one small object 'for cache locality.' Counterintuitively, throughput dropped ~25% and p99 got worse, and it gets WORSE the more cores you add. CPU is high but `perf stat` shows instructions-per-cycle collapsed and a huge spike in 'LLC-stores' / cache-coherence traffic. Each core only ever increments its own conceptual counter, so there's no logical sharing and no lock. How do you triage and explain it?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.