Protecting your state against human error and physical disasters.
High availability (like running 3 database replicas) protects you if a single server crashes. But if an admin accidentally types `DROP TABLE users`, that destructive command is instantly replicated to all 3 servers! To survive logical errors or region-wide physical disasters, you must take point-in-time snapshots of your data and ship them to cold storage.
Taking a full backup of a 10TB database every day is too expensive. Instead, systems take a Full Backup once a week, and then continuously back up the Write-Ahead Log (WAL), which acts as an incremental backup of every single transaction.
# Point-In-Time-Recovery (PITR) Concept
def restore_database(target_time):
# 1. Fetch the last FULL backup before the target_time
full_backup = s3.download("s3://backups/db-full-sunday.tar")
db.load_snapshot(full_backup)
# 2. Fetch all incremental WAL files since Sunday
wal_files = s3.list("s3://backups/wal/", since="Sunday")
# 3. Replay every transaction sequentially up to the target time
for wal in wal_files:
for transaction in wal.transactions:
if transaction.timestamp > target_time:
break # Stop exactly when requested!
db.apply(transaction)
print("Database restored successfully!")
Storage cost is high (S3 is cheap, but TBs add up). But the real cost is RTO (Recovery Time Objective). Replaying days of WAL files to recover a database can take hours of downtime.