On-callHardoc-g111

Subject Data corruptionLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

It's 02:10. PagerDuty fires on a spike of `ERROR: invalid page in block 4711 of relation base/16384/24576` from the primary Postgres (v14) backing the payments ledger. The dashboards show: query error rate climbed from ~0 to 4% over 20 minutes, only on reads touching the `ledger_entries` table; replica lag is flat; the host's `node_disk_io_errors` counter jumped from 0 to 18 in the same window; SMART on the NVMe shows reallocated-sector-count rising. Two hours earlier the host was live-migrated to a new physical node by the cloud provider (maintenance event in the audit log). `data_checksums` is on. How do you triage this, mitigate the blast radius, and recover the corrupted data safely?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.