Durability (Bitrot & Scrubbing)

How storage systems prevent your data from slowly decaying.

The idea

Hard drives are physical objects. Over years, magnetic fields fade, flash memory cells leak charge, and cosmic rays hit silicon. This causes random 1s on your disk to silently flip to 0s. This is called Bitrot. If it happens to a family photo, you get a gray corrupted line. If it happens to a database file, the database crashes. Advanced filesystems (like ZFS) and cloud storage (S3) fight this by continuously running a background process called Data Scrubbing to detect and repair the damage.

Step 1: A file is saved perfectly to disk. We also calculate a Checksum (Hash).

How it works (Checksums & Redundancy)

When you save a file, the system calculates a cryptographic Checksum (like SHA-256) of the data and stores it separately. The Scrubbing process wakes up every week, reads the file, recalculates the checksum, and compares it to the saved one. If they don't match, bitrot has occurred! The system then fetches a healthy copy of the file from a mirrored drive (RAID) and overwrites the corrupted one.

# The concept in code
def background_scrub_routine(disk1, disk2):
    for file in all_files:
        data = disk1.read(file.name)
        
        # Recalculate hash and check against stored hash
        if hash(data) != stored_hashes[file.name]:
            print(f"BITROT DETECTED in {file.name}!")
            
            # Fetch the clean copy from the mirrored backup disk
            clean_data = disk2.read(file.name)
            
            # Overwrite the corrupted data
            disk1.write(file.name, clean_data)
            print("Successfully repaired.")

Cost

Scrubbing requires reading every single byte on the hard drive. On a massive 20 Terabyte hard drive, a full scrub can take days of continuous intense I/O. If you run a scrub during peak business hours, your application will slow to a crawl as the disk struggles to serve user requests and the scrub process simultaneously.

Watch out for