Latency from a failing disk (triage, contain, root-cause)

A dying disk rarely fails cleanly — it slows down first, dragging p99 latency up as it retries bad sectors. Triage the symptom, contain the blast radius, then prove the root cause.

The idea

A disk that is starting to fail usually does not return errors right away. It retries bad sectors internally, so each read takes longer instead of failing. I/O latency climbs: await rises, %util sits high even at low IOPS, and SMART’s reallocated-sector count creeps up.

Because one slow replica serves part of the traffic, the whole service’s tail latency (p99) spikes while the mean stays flat — most requests are fine, but the unlucky ones land on the bad disk. The on-call loop is triage → contain → root-cause: confirm it’s I/O on one node, drain it so user-facing p99 recovers, then prove and replace the failing disk.

Baseline: all three disks healthy, p99 flat and under the SLO line. Press Play to run the incident, or Step forward one beat at a time.

Watch the p99 bars climb past the SLO line as node 2’s disk retries bad sectors, then drop back the moment the bad node is drained. Replacing the disk is the root cause; the user-facing fix already landed at contain.

How it works

The response is a loop, not a single fix. Triage: confirm the symptom is I/O latency on one disk, not the app — mean latency flat but p99 up, and it’s isolated to one node. Contain: drain or cordon the bad node (or fail traffic away from it) so user-facing p99 recovers before you know the exact cause. Root-cause: prove the failing disk with SMART, dmesg, and iostat, replace it, rebuild, then rejoin the node.

# 1. TRIAGE — is one disk's I/O latency the problem?
iostat -x 2          # watch await/svctm and %util per device
#   sda  await=4ms   %util=18%   -> healthy
#   sdb  await=210ms %util=99%   -> the bad disk: high await + high %util at LOW iops

smartctl -a /dev/sdb # reallocated/pending sectors climbing = media failing
dmesg | grep -iE 'i/o error|ata.*reset|medium error'  # kernel I/O errors

# 2. CONTAIN — stop user pain first, before full diagnosis
kubectl cordon node-2          # no new pods land here
kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data
#   traffic shifts to healthy replicas -> service p99 drops back to baseline

# 3. ROOT-CAUSE — prove it, then replace and restore
#   confirm sdb via SMART + dmesg, swap the disk, rebuild the array/replica,
kubectl uncordon node-2        # rejoin only after the disk is healthy

Containing first is the key discipline: the user-facing SLO is restored at the drain step, which buys you calm, un-paged time to confirm and replace the disk.

Signals

Signal	What it indicates
Rising `await` / `svctm` at low IOPS	The disk is slow per-op, not just busy — classic retrying media, not load.
`%util` near 100% while throughput is low	Each I/O is taking far too long; the device is saturated by slowness, not volume.
SMART reallocated / pending sectors climbing	The drive is remapping bad sectors — physical media is degrading.
`dmesg` shows `I/O error`, `ATA reset`, `medium error`	The kernel is hitting hard errors / link resets — a dying disk, not the app.
p99 spikes but mean latency stays flat	Only a slice of requests is slow — a tail problem, pointing at one bad replica.
Latency isolated to one node / device	Not a global regression (deploy, GC, dependency) — it’s local hardware.
App CPU, heap, GC, and deploy history all flat	Rules out an app/GC root cause and steers you toward the disk.

The combination is the tell: a tail-only p99 spike, isolated to one node, with high await at low IOPS and growing SMART counters. Any one alone is ambiguous; together they say failing disk.

Watch out for

Blaming the app for a hardware symptom. A p99 spike looks like a GC pause or a slow query, so teams chase the app for an hour. Check whether mean is flat and the slowness is isolated to one node’s disk before touching the code.
One slow replica poisoning the whole tail. If reads must hit the bad replica, every dependent request can stall on it. Use hedged requests, quorum reads, or fast failover so a single slow disk can’t set the service’s p99.
Restarting the app instead of draining the node. A bounce clears nothing — the disk is still slow, and you just added cold-cache latency. Cordon and drain the node; move traffic off the bad hardware.
Ignoring SMART warnings until hard failure. Reallocated and pending sectors climb for days before a drive dies. Treat rising SMART counters as a scheduled replacement, not a surprise outage.
A degraded RAID member silently slowing the array. One sick disk in a RAID set drags every read through retries while the array still reports “optimal.” Check per-device iostat, not just the array’s health bit.
Not load-shedding while retries pile up. If you keep full traffic on the failing path, retries and timeouts stack into a queue and the spike spreads. Shed or reroute load so the bad disk drains instead of cascading.

Worked example

Pager fires: service p99 has doubled from 40 ms to 95 ms, but the mean is barely moved and error rate is flat. No deploy in the window, app CPU and heap normal — so it is not a code regression.

# Triage: per-device iostat across the three storage nodes
node-1  sda  await=5ms    %util=20%   iops=900    # healthy
node-2  sdb  await=200ms  %util=99%   iops=120    # <-- slow per-op at LOW iops
node-3  sdc  await=6ms    %util=22%   iops=880    # healthy

smartctl -a /dev/sdb | grep -i realloc
#   Reallocated_Sector_Ct   ...  raw 0 last week -> 240 now and rising
dmesg | tail
#   ata2.00: failed command READ FPDMA QUEUED ... ata2: hard resetting link

# Contain: drain node-2 so traffic moves to node-1 and node-3
kubectl cordon node-2
kubectl drain  node-2 --ignore-daemonsets
#   ~30s later: service p99 back to ~42ms. Users recovered. Page resolved.

# Root-cause: replace the disk, rebuild the replica, rejoin
#   swap sdb, let the data store re-replicate to the new disk,
kubectl uncordon node-2   # only after the new disk passes SMART + a soak

The user-facing win landed at the drain, well before the disk was swapped. Triage proved it was I/O on one node; contain restored the SLO; root-cause replaced the hardware and safely brought the node back.

Check yourself

p99 is up sharply, but mean latency is flat and one node’s disk await is 10× the others at low IOPS. What is the right first move?

iostat shows one disk at %util 99% with await 200 ms but only 120 IOPS. What does that pattern most strongly suggest?