Saving petabytes of storage by never saving the same file twice.
Imagine a chat app where a viral meme is forwarded by 100,000 users. If you store the 5MB image every time it's uploaded, you consume 500GB of disk space. With deduplication (Content-Addressable Storage), you hash the file's contents, use the hash as the file name, and only store it once. All 100,000 users just store a tiny pointer to the same hash.
Instead of addressing files by their user-provided name (`/users/alice/meme.png`), we address them by their SHA-256 hash. The database maps the user's file to the underlying hash.
import hashlib
def upload_file(user_id, filename, file_bytes):
# 1. Compute the SHA-256 hash of the content
file_hash = hashlib.sha256(file_bytes).hexdigest()
# 2. Check if the blob already exists in S3
if not s3.exists(f"blobs/{file_hash}"):
s3.upload(file_bytes, f"blobs/{file_hash}")
# 3. Save a pointer in the database
# Even if S3 upload was skipped, we still record that 'user_id' owns a reference to it
db.execute(
"INSERT INTO user_files (user_id, name, blob_hash) VALUES (?, ?, ?)",
(user_id, filename, file_hash)
)
Time Complexity: O(N) where N is the file size to compute the hash (usually done in chunks as it streams in). Space Complexity: Saves O(M * N) space where M is the number of duplicate uploads.