A tiny fingerprint stored next to your data, so a single flipped bit can't sneak through unnoticed.
Disks rot. Bits flip in transit, on a cable, in a memory module, on a platter that's been spinning for three years. If your storage layer just hands those bytes back without checking, you get silent corruption — the worst kind, because nothing tells you it happened.
A checksum is a small number computed deterministically from a block's bytes. You store it alongside the block when you write. On every read you recompute it from the bytes you got back and compare. If even one bit changed, the recomputed value won't match — and the read is rejected instead of silently returning garbage.
Storage layers like ZFS, HDFS, and btrfs store a checksum per block (ZFS keeps it in the parent pointer, separate from the data) and verify it on every read. On a mismatch they don't return the bytes — they raise an error and, if redundancy exists, repair the block from a mirror, replica, or parity ("self-healing").
# on write: persist the block AND its checksum
def write_block(block):
ck = crc32(block) # cheap, O(bytes)
disk.put(block, checksum=ck) # store fingerprint next to data
# on read: recompute and compare before trusting the bytes
def read_block(block_id):
block, stored_ck = disk.get(block_id)
if crc32(block) != stored_ck:
raise ChecksumError(block_id) # do NOT return corrupt bytes
# or: return repair_from_replica(block_id)
return block
This demo computes a real CRC-32 live (standard IEEE polynomial 0xEDB88320, the same one zlib and gzip use), so every hex value you see is the actual computed value. CRC-32 catches all single-bit errors and most burst errors very cheaply — but it is not cryptographic. It stops accidents, not an adversary.
| Dimension | What it costs / catches |
|---|---|
| Compute cost | Cheap, O(bytes) — a single linear pass; table-driven CRC-32 runs at GB/s |
| Storage overhead | A few bytes per block (CRC-32 = 4 bytes; a SHA-256 digest = 32 bytes) |
| What it catches | Bit rot on disk, torn / partial writes, in-transit flips, misdirected / phantom writes (with self-id) |
| What it does not catch | Deliberate tampering — a CRC can be recomputed for altered data; needs an HMAC or signed hash |
HMAC or a signed SHA-256, not a CRC.Take an 8-byte block holding the ASCII for "Hello DB":
48 65 6C 6C 6F 20 44 42. On write we compute and store its CRC-32:
stored_crc = crc32([0x48,0x65,0x6C,0x6C,0x6F,0x20,0x44,0x42])
= 0x32EF3B16
Months later the platter develops bit rot: the low bit of byte 2 flips, so 0x6C ('l') becomes 0x6D ('m'). On read we recompute over the bytes we got back:
recomputed = crc32([0x48,0x65,0x6D,0x6C,0x6F,0x20,0x44,0x42])
= 0xF9B3E8B3
0xF9B3E8B3 != 0x32EF3B16 # mismatch
The recomputed CRC differs from the stored one, so the read is rejected. The system raises ChecksumError and, because the block is mirrored, serves the good copy from a replica instead — the corruption is caught, never returned. (These are the exact values the demo above computes.)
A read returns a block whose recomputed CRC-32 matches the stored CRC. Does that guarantee the bytes weren't maliciously altered?
Where should you store a block's checksum to best detect corruption?