Question
A write-heavy ingestion service (writes batches to local disk / a memory-mapped store, fsync periodically) runs smoothly for minutes, then every so often the WHOLE process stalls for 1-3 seconds — all request threads freeze together, including ones doing no disk I/O at all — then it snaps back. During a stall, `iostat` shows the disk pinned at 100% util with a long write queue and high `await`; `vmstat` shows a big block of memory transitioning out of 'dirty'; CPU is in `iowait`. No GC, no lock, no deploy. The host has lots of RAM and the app buffers writes aggressively before flushing. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.