Verifying backups (checksum + test restore)

A backup you have never restored is a hope, not a backup — pin it down with a checksum and an actual test restore.

The idea

Taking a backup is the easy half. The hard half is proving it would actually bring your data back. A backup job can report “success” while writing a truncated file, skipping a table, or saving an artifact you cannot decrypt — and you only find out during the disaster, which is the worst possible moment.

Two layers, cheap to expensive, catch this. First, a checksum (say SHA-256) recorded at backup time and recomputed later detects silent corruption and bit-rot. Second, a periodic test restore into a throwaway environment, validated with row counts and a known canary record, proves the artifact and the restore tooling really work end to end.

switch the run, then Play to watch it

A backup moves through the pipeline: take it, checksum it, store it — then later recompute the checksum, test-restore it, and validate. Press Play.

The healthy run ends verified. The caught-failure run flips a bit in the stored artifact: the recomputed checksum no longer matches what was recorded, and verification stops the bad backup before a real outage ever relies on it.

How it works

Record the checksum at the moment you create the artifact, store it alongside the backup (ideally in separate, signed metadata), and recompute it later to detect any drift. Checksums are cheap, so run them often. A test restore is expensive, so schedule it — daily, weekly, or monthly per your recovery objectives — into an isolated scratch environment, then validate with queries, not vibes.

# 1. Take the backup and record a checksum at creation time
pg_dump --format=custom mydb > backup_2026-06-26.dump
sha256sum backup_2026-06-26.dump > backup_2026-06-26.dump.sha256

# 2. Upload artifact + checksum to a DIFFERENT account/region (separate blast radius)
aws s3 cp backup_2026-06-26.dump        s3://offsite-backups/db/
aws s3 cp backup_2026-06-26.dump.sha256 s3://offsite-backups/db/

# ---- later: scheduled verification job ----------------------------------

# 3. Re-verify the checksum — catches bit-rot, truncation, silent corruption
sha256sum --check backup_2026-06-26.dump.sha256 || alert "checksum mismatch — corrupt backup"

# 4. Test-restore into a throwaway scratch database (never prod)
createdb restore_scratch
pg_restore --dbname=restore_scratch backup_2026-06-26.dump || alert "restore failed"

# 5. Validate the restored data: row counts + a known canary row
psql restore_scratch -tAc \
  "SELECT count(*) = 1204889 AND bool_or(id = 'canary-row') FROM orders;" \
  | grep -q t || alert "validation failed — restored data is wrong"

dropdb restore_scratch   # the scratch env is disposable

If every step passes, record “verified” with a timestamp and trust your recovery point. If any step fails, the alert fires now — on a calm Tuesday — instead of mid-incident.

Signals

Practice	What it catches	Cost
Checksum at creation	Nothing yet — but it is the reference every later check compares against	Negligible (one hash)
Checksum re-verify	Bit-rot, truncation, partial uploads, silent storage corruption	Cheap — run it often
Test restore to scratch	Broken restore tooling, schema drift, missing decryption key, unreadable format	Expensive — schedule it
Row-count validation	A backup that excluded a table or restored only part of the data	One query per critical table
Canary row check	Partial restores that pass on count but lost specific records	One known-value lookup
App-boots-on-restore	Data that loads but the application cannot actually use	Moderate — a smoke test

Frame it in objectives. RPO (recovery point — how much data you can lose) is only real if the backup at that point actually restores; RTO (recovery time) is only real if you have measured a restore. A test restore is also a stopwatch on your RTO — the run that never happened is the run that overruns the SLA.

Watch out for

Never test-restoring. A green backup job proves bytes were written, not that they restore. The first real restore is a terrible time to learn the tooling is broken. Schedule a recurring restore drill.
Checksumming but never re-checking. A hash recorded once and never recompared detects nothing. Bit-rot happens silently in storage; re-verify on a schedule and after every move or copy.
Backups that succeed but exclude a table. A changed --exclude-table flag, or a new table the dump filter never learned about, restores cleanly and silently incomplete. Validate row counts against expectations per critical table.
The missing decryption key. An encrypted backup is worthless if the key lives only in the system you just lost. Store keys separately, and prove the restore can decrypt as part of the drill.
Same blast radius as the source. Backups in the same account, region, or cluster as production die with it — to ransomware, an over-broad delete, or a region outage. Keep at least one copy offsite and immutable.
A “green” job that wrote zero bytes. A broken pipe or a full disk can exit 0 with an empty artifact. Assert a sane minimum size and a valid checksum, not just an exit code.

Worked example

A team runs a monthly restore drill. The nightly pg_dump has reported success for eleven straight months, so everyone assumes recovery is fine. The drill restores last night’s artifact into a scratch database and validates row counts per table.

== Monthly restore drill — 2026-06-26 ==
checksum --check ........ OK   (sha256 matches recorded value)
pg_restore ............. OK   (restored into restore_scratch)
validate orders ........ OK   (1,204,889 rows, canary present)
validate users ......... OK   (88,301 rows, canary present)
validate audit_log ..... FAIL (0 rows — expected ~6.2M)

coach note: a schema change last quarter renamed audit_log's
parent table, and the dump's --exclude-table glob silently
started matching it. Every nightly backup since has shipped
WITHOUT audit_log. Caught on a Tuesday drill — not during a
compliance restore.

The checksum and the restore both passed: the artifact was intact and the tooling worked. Only the validation query caught the missing table. Without row-count validation, the gap would have surfaced months later during an audit, with no good backup to fall back to. The fix — correct the exclude glob and re-run the drill — took an afternoon instead of an incident.

Check yourself

Your nightly backup job has reported “success” every night for a year. Are you safe to rely on it for recovery?

A test restore finishes cleanly and the checksum matched, but one critical table comes back with zero rows. Which check caught the real problem?