On-callHardoc-g184

Subject Swap thrashingLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Software development

Question

A self-managed Elasticsearch data node becomes unresponsive: query latency goes from 50ms to 20s, the node drops in and out of the cluster, and CPU shows 70% iowait while user CPU is low. `vmstat 1` shows si/so (swap in/out) in the hundreds of MB per second sustained, load average 40 on an 8-core box, and free memory near zero. A colleague added a heavy aggregation-heavy reporting workload this morning, and someone had earlier set swappiness to 60 and left a 16GB swap file enabled. Heap is sized at 26GB on a 32GB host. Describe triage and mitigation.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.