Code Room
On-callHard
Question
p99 read latency on a stateful storage service jumped from 8ms to 600ms an hour ago, but p50 is unchanged and overall throughput is flat. The spike is isolated to one of twelve storage nodes: on that node, iostat shows %util near 100% with await in the hundreds of milliseconds and a growing queue depth, and dmesg has a handful of 'I/O error' / sector-remap lines. No deploy went out; traffic is normal. Requests that happen to hit that node's shards see huge tail latency. How do you triage, restore good tail latency now, and handle the failing hardware durably?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.