Code Room
On-callHard
Question
A self-managed Elasticsearch data node becomes unresponsive: query latency goes from 50ms to 20s, the node drops in and out of the cluster, and CPU shows 70% iowait while user CPU is low. `vmstat 1` shows si/so (swap in/out) in the hundreds of MB per second sustained, load average 40 on an 8-core box, and free memory near zero. A colleague added a heavy aggregation-heavy reporting workload this morning, and someone had earlier set swappiness to 60 and left a 16GB swap file enabled. Heap is sized at 26GB on a 32GB host. Describe triage and mitigation.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.