Verifying backups (checksum + test restore)

A backup you have never restored is a hope, not a backup — pin it down with a checksum and an actual test restore.

The idea

Taking a backup is the easy half. The hard half is proving it would actually bring your data back. A backup job can report “success” while writing a truncated file, skipping a table, or saving an artifact you cannot decrypt — and you only find out during the disaster, which is the worst possible moment.

Two layers, cheap to expensive, catch this. First, a checksum (say SHA-256) recorded at backup time and recomputed later detects silent corruption and bit-rot. Second, a periodic test restore into a throwaway environment, validated with row counts and a known canary record, proves the artifact and the restore tooling really work end to end.

switch the run, then Play to watch it
A backup moves through the pipeline: take it, checksum it, store it — then later recompute the checksum, test-restore it, and validate. Press Play.

The healthy run ends verified. The caught-failure run flips a bit in the stored artifact: the recomputed checksum no longer matches what was recorded, and verification stops the bad backup before a real outage ever relies on it.

How it works

Record the checksum at the moment you create the artifact, store it alongside the backup (ideally in separate, signed metadata), and recompute it later to detect any drift. Checksums are cheap, so run them often. A test restore is expensive, so schedule it — daily, weekly, or monthly per your recovery objectives — into an isolated scratch environment, then validate with queries, not vibes.

# 1. Take the backup and record a checksum at creation time
pg_dump --format=custom mydb > backup_2026-06-26.dump
sha256sum backup_2026-06-26.dump > backup_2026-06-26.dump.sha256

# 2. Upload artifact + checksum to a DIFFERENT account/region (separate blast radius)
aws s3 cp backup_2026-06-26.dump        s3://offsite-backups/db/
aws s3 cp backup_2026-06-26.dump.sha256 s3://offsite-backups/db/

# ---- later: scheduled verification job ----------------------------------

# 3. Re-verify the checksum — catches bit-rot, truncation, silent corruption
sha256sum --check backup_2026-06-26.dump.sha256 || alert "checksum mismatch — corrupt backup"

# 4. Test-restore into a throwaway scratch database (never prod)
createdb restore_scratch
pg_restore --dbname=restore_scratch backup_2026-06-26.dump || alert "restore failed"

# 5. Validate the restored data: row counts + a known canary row
psql restore_scratch -tAc \
  "SELECT count(*) = 1204889 AND bool_or(id = 'canary-row') FROM orders;" \
  | grep -q t || alert "validation failed — restored data is wrong"

dropdb restore_scratch   # the scratch env is disposable

If every step passes, record “verified” with a timestamp and trust your recovery point. If any step fails, the alert fires now — on a calm Tuesday — instead of mid-incident.

Signals

PracticeWhat it catchesCost
Checksum at creationNothing yet — but it is the reference every later check compares againstNegligible (one hash)
Checksum re-verifyBit-rot, truncation, partial uploads, silent storage corruptionCheap — run it often
Test restore to scratchBroken restore tooling, schema drift, missing decryption key, unreadable formatExpensive — schedule it
Row-count validationA backup that excluded a table or restored only part of the dataOne query per critical table
Canary row checkPartial restores that pass on count but lost specific recordsOne known-value lookup
App-boots-on-restoreData that loads but the application cannot actually useModerate — a smoke test

Frame it in objectives. RPO (recovery point — how much data you can lose) is only real if the backup at that point actually restores; RTO (recovery time) is only real if you have measured a restore. A test restore is also a stopwatch on your RTO — the run that never happened is the run that overruns the SLA.

Watch out for

Worked example

A team runs a monthly restore drill. The nightly pg_dump has reported success for eleven straight months, so everyone assumes recovery is fine. The drill restores last night’s artifact into a scratch database and validates row counts per table.

== Monthly restore drill — 2026-06-26 ==
checksum --check ........ OK   (sha256 matches recorded value)
pg_restore ............. OK   (restored into restore_scratch)
validate orders ........ OK   (1,204,889 rows, canary present)
validate users ......... OK   (88,301 rows, canary present)
validate audit_log ..... FAIL (0 rows — expected ~6.2M)

coach note: a schema change last quarter renamed audit_log's
parent table, and the dump's --exclude-table glob silently
started matching it. Every nightly backup since has shipped
WITHOUT audit_log. Caught on a Tuesday drill — not during a
compliance restore.

The checksum and the restore both passed: the artifact was intact and the tooling worked. Only the validation query caught the missing table. Without row-count validation, the gap would have surfaced months later during an audit, with no good backup to fall back to. The fix — correct the exclude glob and re-run the drill — took an afternoon instead of an incident.

Check yourself

Your nightly backup job has reported “success” every night for a year. Are you safe to rely on it for recovery?

A test restore finishes cleanly and the checksum matched, but one critical table comes back with zero rows. Which check caught the real problem?