Code Room
On-callHard
Question
A self-managed analytics box that colocates a query engine and several batch workers becomes molasses-slow every weeknight around 22:00: p99 on the query engine jumps from 80ms to 12s for ~20 minutes, then recovers on its own. During the window `vmstat 1` shows si/so (swap in/out) in the thousands, CPU is ~60% iowait with low user CPU, and there are no OOM kills in `dmesg`. The box has 64GB RAM and 32GB swap, `vm.overcommit_memory=1` (always overcommit), and a nightly batch job was scaled up last sprint to use more parallel workers. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.