Latency from a failing disk (triage, contain, root-cause)

A dying disk rarely fails cleanly — it slows down first, dragging p99 latency up as it retries bad sectors. Triage the symptom, contain the blast radius, then prove the root cause.

The idea

A disk that is starting to fail usually does not return errors right away. It retries bad sectors internally, so each read takes longer instead of failing. I/O latency climbs: await rises, %util sits high even at low IOPS, and SMART’s reallocated-sector count creeps up.

Because one slow replica serves part of the traffic, the whole service’s tail latency (p99) spikes while the mean stays flat — most requests are fine, but the unlucky ones land on the bad disk. The on-call loop is triage → contain → root-cause: confirm it’s I/O on one node, drain it so user-facing p99 recovers, then prove and replace the failing disk.

Baseline: all three disks healthy, p99 flat and under the SLO line. Press Play to run the incident, or Step forward one beat at a time.

Watch the p99 bars climb past the SLO line as node 2’s disk retries bad sectors, then drop back the moment the bad node is drained. Replacing the disk is the root cause; the user-facing fix already landed at contain.

How it works

The response is a loop, not a single fix. Triage: confirm the symptom is I/O latency on one disk, not the app — mean latency flat but p99 up, and it’s isolated to one node. Contain: drain or cordon the bad node (or fail traffic away from it) so user-facing p99 recovers before you know the exact cause. Root-cause: prove the failing disk with SMART, dmesg, and iostat, replace it, rebuild, then rejoin the node.

# 1. TRIAGE — is one disk's I/O latency the problem?
iostat -x 2          # watch await/svctm and %util per device
#   sda  await=4ms   %util=18%   -> healthy
#   sdb  await=210ms %util=99%   -> the bad disk: high await + high %util at LOW iops

smartctl -a /dev/sdb # reallocated/pending sectors climbing = media failing
dmesg | grep -iE 'i/o error|ata.*reset|medium error'  # kernel I/O errors

# 2. CONTAIN — stop user pain first, before full diagnosis
kubectl cordon node-2          # no new pods land here
kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data
#   traffic shifts to healthy replicas -> service p99 drops back to baseline

# 3. ROOT-CAUSE — prove it, then replace and restore
#   confirm sdb via SMART + dmesg, swap the disk, rebuild the array/replica,
kubectl uncordon node-2        # rejoin only after the disk is healthy

Containing first is the key discipline: the user-facing SLO is restored at the drain step, which buys you calm, un-paged time to confirm and replace the disk.

Signals

SignalWhat it indicates
Rising await / svctm at low IOPSThe disk is slow per-op, not just busy — classic retrying media, not load.
%util near 100% while throughput is lowEach I/O is taking far too long; the device is saturated by slowness, not volume.
SMART reallocated / pending sectors climbingThe drive is remapping bad sectors — physical media is degrading.
dmesg shows I/O error, ATA reset, medium errorThe kernel is hitting hard errors / link resets — a dying disk, not the app.
p99 spikes but mean latency stays flatOnly a slice of requests is slow — a tail problem, pointing at one bad replica.
Latency isolated to one node / deviceNot a global regression (deploy, GC, dependency) — it’s local hardware.
App CPU, heap, GC, and deploy history all flatRules out an app/GC root cause and steers you toward the disk.

The combination is the tell: a tail-only p99 spike, isolated to one node, with high await at low IOPS and growing SMART counters. Any one alone is ambiguous; together they say failing disk.

Watch out for

Worked example

Pager fires: service p99 has doubled from 40 ms to 95 ms, but the mean is barely moved and error rate is flat. No deploy in the window, app CPU and heap normal — so it is not a code regression.

# Triage: per-device iostat across the three storage nodes
node-1  sda  await=5ms    %util=20%   iops=900    # healthy
node-2  sdb  await=200ms  %util=99%   iops=120    # <-- slow per-op at LOW iops
node-3  sdc  await=6ms    %util=22%   iops=880    # healthy

smartctl -a /dev/sdb | grep -i realloc
#   Reallocated_Sector_Ct   ...  raw 0 last week -> 240 now and rising
dmesg | tail
#   ata2.00: failed command READ FPDMA QUEUED ... ata2: hard resetting link

# Contain: drain node-2 so traffic moves to node-1 and node-3
kubectl cordon node-2
kubectl drain  node-2 --ignore-daemonsets
#   ~30s later: service p99 back to ~42ms. Users recovered. Page resolved.

# Root-cause: replace the disk, rebuild the replica, rejoin
#   swap sdb, let the data store re-replicate to the new disk,
kubectl uncordon node-2   # only after the new disk passes SMART + a soak

The user-facing win landed at the drain, well before the disk was swapped. Triage proved it was I/O on one node; contain restored the SLO; root-cause replaced the hardware and safely brought the node back.

Check yourself

p99 is up sharply, but mean latency is flat and one node’s disk await is 10× the others at low IOPS. What is the right first move?

iostat shows one disk at %util 99% with await 200 ms but only 120 IOPS. What does that pattern most strongly suggest?