On-callHardoc-g402

Subject Swap thrashingLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A self-managed analytics box that colocates a query engine and several batch workers becomes molasses-slow every weeknight around 22:00: p99 on the query engine jumps from 80ms to 12s for ~20 minutes, then recovers on its own. During the window `vmstat 1` shows si/so (swap in/out) in the thousands, CPU is ~60% iowait with low user CPU, and there are no OOM kills in `dmesg`. The box has 64GB RAM and 32GB swap, `vm.overcommit_memory=1` (always overcommit), and a nightly batch job was scaled up last sprint to use more parallel workers. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.