Database Disk Saturation (IOPS)

When the hard drive becomes a traffic jam and takes down the API.

The idea

We often think of CPU or RAM as the main bottlenecks for a database. But in reality, Cloud Hard Drives (like AWS EBS) have strict speed limits, measured in IOPS (Input/Output Operations Per Second). If your database tries to read or write to disk 5,000 times a second, but your disk is hard-capped at 3,000 IOPS, the remaining 2,000 requests are queued. The disk is "Saturated." Queries that usually take 5ms suddenly take 5,000ms, and the whole application grinds to a halt.

Step 1: Normal operations. Queries require disk reads, which are handled instantly.

How it works (IO Wait)

When the disk is saturated, the CPU isn't actually working hard. Instead, the CPU threads enter a state called iowait. They are literally asleep, waiting for the physical disk to return data. A server with 99% iowait will look like it has high CPU usage on a dashboard, but it's actually completely idle, blocked by the hard drive.

# Monitoring IOPS and Wait in Linux
$ iostat -dx 1

Device:         rrqm/s   wrqm/s     r/s     w/s   %util
nvme0n1           0.00     0.00 3000.00    0.00  100.00% 

# If %util is 100%, the disk is saturated!
# The r/s (Reads per second) is hitting the cloud IOPS limit.

Cost

Fixing IOPS limits in the cloud is easy but expensive: you just pay AWS more money to provision a faster SSD (e.g. switching from gp2 to io1). However, the engineering fix is much cheaper: add a Redis Cache in front of the database, or add covering indexes so the DB can answer queries using RAM instead of fetching full blocks from the disk.

Watch out for