Blob Storage Deduplication

Saving petabytes of storage by never saving the same file twice.

The idea

Imagine a chat app where a viral meme is forwarded by 100,000 users. If you store the 5MB image every time it's uploaded, you consume 500GB of disk space. With deduplication (Content-Addressable Storage), you hash the file's contents, use the hash as the file name, and only store it once. All 100,000 users just store a tiny pointer to the same hash.

Step 1: Alice uploads "meme.png". The system hashes it and stores it.

How it works (Content-Addressable Storage)

Instead of addressing files by their user-provided name (`/users/alice/meme.png`), we address them by their SHA-256 hash. The database maps the user's file to the underlying hash.

import hashlib

def upload_file(user_id, filename, file_bytes):
    # 1. Compute the SHA-256 hash of the content
    file_hash = hashlib.sha256(file_bytes).hexdigest()
    
    # 2. Check if the blob already exists in S3
    if not s3.exists(f"blobs/{file_hash}"):
        s3.upload(file_bytes, f"blobs/{file_hash}")
        
    # 3. Save a pointer in the database
    # Even if S3 upload was skipped, we still record that 'user_id' owns a reference to it
    db.execute(
        "INSERT INTO user_files (user_id, name, blob_hash) VALUES (?, ?, ?)",
        (user_id, filename, file_hash)
    )

Cost

Time Complexity: O(N) where N is the file size to compute the hash (usually done in chunks as it streams in). Space Complexity: Saves O(M * N) space where M is the number of duplicate uploads.

Watch out for

Garbage Collection: If Alice deletes "meme.png", you can't delete the underlying blob! Bob might still be linking to it. You must use reference counting, and only delete the blob when the reference count hits zero.
Hash Collisions: Though cryptographically unlikely with SHA-256, if two different files produce the same hash, the second file is silently discarded and replaced by the first.