On-callHardoc-g454

Subject SaturationLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A write-heavy ingestion service (writes batches to local disk / a memory-mapped store, fsync periodically) runs smoothly for minutes, then every so often the WHOLE process stalls for 1-3 seconds — all request threads freeze together, including ones doing no disk I/O at all — then it snaps back. During a stall, `iostat` shows the disk pinned at 100% util with a long write queue and high `await`; `vmstat` shows a big block of memory transitioning out of 'dirty'; CPU is in `iowait`. No GC, no lock, no deploy. The host has lots of RAM and the app buffers writes aggressively before flushing. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.