Question
A Postgres primary backing your payments service shows a write-latency cliff: COMMIT latency p99 jumped from 3ms to 480ms starting at 02:10, while CPU, query plans, connection count, and read latency are all normal. Dashboards: `pg_stat_wal` shows WAL write/sync time dominating commit time; OS-level `fsync` latency on the WAL device spiked; the underlying volume is a network-attached block device (cloud EBS-style); the storage provider's per-volume IOPS/throughput graph shows the volume hit its provisioned IOPS ceiling and is being throttled with rising queue depth; a nightly vacuum/analyze and a logical backup both kicked off at 02:00. Throughput of actual transactions is down because each commit waits on fsync. How do you triage this commit-latency / durability-path spike?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.