Blob Storage Deduplication

Saving petabytes of storage by never saving the same file twice.

The idea

Imagine a chat app where a viral meme is forwarded by 100,000 users. If you store the 5MB image every time it's uploaded, you consume 500GB of disk space. With deduplication (Content-Addressable Storage), you hash the file's contents, use the hash as the file name, and only store it once. All 100,000 users just store a tiny pointer to the same hash.

Step 1: Alice uploads "meme.png". The system hashes it and stores it.

How it works (Content-Addressable Storage)

Instead of addressing files by their user-provided name (`/users/alice/meme.png`), we address them by their SHA-256 hash. The database maps the user's file to the underlying hash.

import hashlib

def upload_file(user_id, filename, file_bytes):
    # 1. Compute the SHA-256 hash of the content
    file_hash = hashlib.sha256(file_bytes).hexdigest()
    
    # 2. Check if the blob already exists in S3
    if not s3.exists(f"blobs/{file_hash}"):
        s3.upload(file_bytes, f"blobs/{file_hash}")
        
    # 3. Save a pointer in the database
    # Even if S3 upload was skipped, we still record that 'user_id' owns a reference to it
    db.execute(
        "INSERT INTO user_files (user_id, name, blob_hash) VALUES (?, ?, ?)",
        (user_id, filename, file_hash)
    )

Cost

Time Complexity: O(N) where N is the file size to compute the hash (usually done in chunks as it streams in). Space Complexity: Saves O(M * N) space where M is the number of duplicate uploads.

Watch out for