On-callHardoc-g649

Subject Fsync durability commit latencyLevel Mid–Senior~35 minCommon in Storage & CDN interviewsIndustries Technology

Question

A Postgres primary backing your payments service shows a write-latency cliff: COMMIT latency p99 jumped from 3ms to 480ms starting at 02:10, while CPU, query plans, connection count, and read latency are all normal. Dashboards: `pg_stat_wal` shows WAL write/sync time dominating commit time; OS-level `fsync` latency on the WAL device spiked; the underlying volume is a network-attached block device (cloud EBS-style); the storage provider's per-volume IOPS/throughput graph shows the volume hit its provisioned IOPS ceiling and is being throttled with rising queue depth; a nightly vacuum/analyze and a logical backup both kicked off at 02:00. Throughput of actual transactions is down because each commit waits on fsync. How do you triage this commit-latency / durability-path spike?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.