A backup you have never restored is a hope, not a backup — pin it down with a checksum and an actual test restore.
Taking a backup is the easy half. The hard half is proving it would actually bring your data back. A backup job can report “success” while writing a truncated file, skipping a table, or saving an artifact you cannot decrypt — and you only find out during the disaster, which is the worst possible moment.
Two layers, cheap to expensive, catch this. First, a checksum (say SHA-256) recorded at backup time and recomputed later detects silent corruption and bit-rot. Second, a periodic test restore into a throwaway environment, validated with row counts and a known canary record, proves the artifact and the restore tooling really work end to end.
The healthy run ends verified. The caught-failure run flips a bit in the stored artifact: the recomputed checksum no longer matches what was recorded, and verification stops the bad backup before a real outage ever relies on it.
Record the checksum at the moment you create the artifact, store it alongside the backup (ideally in separate, signed metadata), and recompute it later to detect any drift. Checksums are cheap, so run them often. A test restore is expensive, so schedule it — daily, weekly, or monthly per your recovery objectives — into an isolated scratch environment, then validate with queries, not vibes.
# 1. Take the backup and record a checksum at creation time
pg_dump --format=custom mydb > backup_2026-06-26.dump
sha256sum backup_2026-06-26.dump > backup_2026-06-26.dump.sha256
# 2. Upload artifact + checksum to a DIFFERENT account/region (separate blast radius)
aws s3 cp backup_2026-06-26.dump s3://offsite-backups/db/
aws s3 cp backup_2026-06-26.dump.sha256 s3://offsite-backups/db/
# ---- later: scheduled verification job ----------------------------------
# 3. Re-verify the checksum — catches bit-rot, truncation, silent corruption
sha256sum --check backup_2026-06-26.dump.sha256 || alert "checksum mismatch — corrupt backup"
# 4. Test-restore into a throwaway scratch database (never prod)
createdb restore_scratch
pg_restore --dbname=restore_scratch backup_2026-06-26.dump || alert "restore failed"
# 5. Validate the restored data: row counts + a known canary row
psql restore_scratch -tAc \
"SELECT count(*) = 1204889 AND bool_or(id = 'canary-row') FROM orders;" \
| grep -q t || alert "validation failed — restored data is wrong"
dropdb restore_scratch # the scratch env is disposable
If every step passes, record “verified” with a timestamp and trust your recovery point. If any step fails, the alert fires now — on a calm Tuesday — instead of mid-incident.
| Practice | What it catches | Cost |
|---|---|---|
| Checksum at creation | Nothing yet — but it is the reference every later check compares against | Negligible (one hash) |
| Checksum re-verify | Bit-rot, truncation, partial uploads, silent storage corruption | Cheap — run it often |
| Test restore to scratch | Broken restore tooling, schema drift, missing decryption key, unreadable format | Expensive — schedule it |
| Row-count validation | A backup that excluded a table or restored only part of the data | One query per critical table |
| Canary row check | Partial restores that pass on count but lost specific records | One known-value lookup |
| App-boots-on-restore | Data that loads but the application cannot actually use | Moderate — a smoke test |
Frame it in objectives. RPO (recovery point — how much data you can lose) is only real if the backup at that point actually restores; RTO (recovery time) is only real if you have measured a restore. A test restore is also a stopwatch on your RTO — the run that never happened is the run that overruns the SLA.
--exclude-table flag, or a new table the dump filter never learned about, restores cleanly and silently incomplete. Validate row counts against expectations per critical table.A team runs a monthly restore drill. The nightly pg_dump has reported success for eleven straight months, so everyone assumes recovery is fine. The drill restores last night’s artifact into a scratch database and validates row counts per table.
== Monthly restore drill — 2026-06-26 ==
checksum --check ........ OK (sha256 matches recorded value)
pg_restore ............. OK (restored into restore_scratch)
validate orders ........ OK (1,204,889 rows, canary present)
validate users ......... OK (88,301 rows, canary present)
validate audit_log ..... FAIL (0 rows — expected ~6.2M)
coach note: a schema change last quarter renamed audit_log's
parent table, and the dump's --exclude-table glob silently
started matching it. Every nightly backup since has shipped
WITHOUT audit_log. Caught on a Tuesday drill — not during a
compliance restore.
The checksum and the restore both passed: the artifact was intact and the tooling worked. Only the validation query caught the missing table. Without row-count validation, the gap would have surfaced months later during an audit, with no good backup to fall back to. The fix — correct the exclude glob and re-run the drill — took an afternoon instead of an incident.
Your nightly backup job has reported “success” every night for a year. Are you safe to rely on it for recovery?
A test restore finishes cleanly and the checksum matched, but one critical table comes back with zero rows. Which check caught the real problem?