Silent Backup Failure

The worst disaster is discovering your backups don't actually work.

The idea

A cron job that runs pg_dump > backup.sql seems sufficient, until the disk fills up, the API key expires, or the database schema changes in a way that breaks the tool. The script might fail silently, or worse, succeed but produce a corrupted, zero-byte file. You only discover this when a disaster strikes and you desperately need the data.

Step 1: Nightly backup script runs successfully. A 50GB file is uploaded.

How it works (Verification)

The only solution to silent backup failure is Automated Restore Verification. A backup is not a backup until it has been successfully restored into a sandbox environment and validated.

# The right way to do backups
def nightly_backup():
    # 1. Take the backup
    file_path = run_pg_dump()
    s3.upload(file_path)
    
    # 2. PROVE IT WORKS immediately
    sandbox_db = spin_up_temp_database()
    try:
        sandbox_db.restore(file_path)
        
        # 3. Validate the data isn't empty/corrupted
        count = sandbox_db.query("SELECT COUNT(*) FROM users")
        if count < EXPECTED_MINIMUM:
            raise Exception("Backup restored but missing data!")
            
        alert_oncall("Backup Success & Verified!")
    except Exception as e:
        alert_oncall(f"BACKUP FAILURE: {e}")
    finally:
        sandbox_db.destroy()

Cost

Automated verification costs compute time (spinning up a temporary database to ingest the data) and engineering effort. But compared to losing a company's entire dataset, this cost is trivial.

Watch out for

Monitoring the monitor: If your backup cron job stops running entirely, it won't emit a failure metric. Use "Dead Man's Snitches" (warranting an alert if a success ping is NOT received within 24 hours).
Encryption Keys: If you encrypt your backups (which you should), ensure the decryption keys are stored in a highly available, separate system. Losing the key means losing the backup.