Question
It's 02:10. PagerDuty fires on a spike of `ERROR: invalid page in block 4711 of relation base/16384/24576` from the primary Postgres (v14) backing the payments ledger. The dashboards show: query error rate climbed from ~0 to 4% over 20 minutes, only on reads touching the `ledger_entries` table; replica lag is flat; the host's `node_disk_io_errors` counter jumped from 0 to 18 in the same window; SMART on the NVMe shows reallocated-sector-count rising. Two hours earlier the host was live-migrated to a new physical node by the cloud provider (maintenance event in the audit log). `data_checksums` is on. How do you triage this, mitigate the blast radius, and recover the corrupted data safely?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.