On-callHardoc-g446

Subject Latency spikesLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

A high-throughput Java metrics-aggregation service regressed after a 'cleanup' refactor that grouped several hot per-request `AtomicLong` counters (request count, byte count, a hot config-version field, an error count) together as adjacent fields in one small object 'for cache locality.' Counterintuitively, throughput dropped ~25% and p99 got worse, and it gets WORSE the more cores you add. CPU is high but `perf stat` shows instructions-per-cycle collapsed and a huge spike in 'LLC-stores' / cache-coherence traffic. Each core only ever increments its own conceptual counter, so there's no logical sharing and no lock. How do you triage and explain it?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.