On-callHardoc-g411

Subject Swap thrashingLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A self-managed search node on Kubernetes (memory limit 16Gi, swap enabled on the node) starts seeing query p99 climb from 60ms to 8s during the daily index-merge window, with the pod hovering just under its 16Gi limit the whole time and never OOMKilled. Node `vmstat` shows si/so spiking and high major-page-fault rate; CPU is mostly iowait. The engine memory-maps its index segments and relies on the OS page cache. A recent change raised the JVM/heap allocation inside the container, leaving much less room under the cgroup limit for the mmap'd page cache. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.