A write isn’t done when the function returns — it’s done when the data is durable. Until then, never tell the client “saved.”
Imagine handing a letter to a courier. The moment they take it from your hand, the letter is sent — but it is not yet delivered. If you walk away and assume it arrived, you might never learn that the courier dropped it in a puddle. A careful sender waits for a signed receipt before crossing the task off the list.
Durable storage works the same way. When you call write() the bytes often sit in an in-memory buffer, not yet on the platter or flash. The call can return success while the data is still one power cut away from vanishing. A robust storage layer never trusts that return value alone. It verifies durability (force the data down with fsync and check what comes back, or count acknowledgements from a quorum of replicas), retries transient failures with backoff, fails over to a healthy replica, and only then acknowledges the client. If durability can’t be reached, it returns an error rather than silently losing the write.
The coordinator writes to each replica, forces the data to stable storage, and counts acknowledgements. It stops as soon as a quorum is durable. A transient failure is retried with backoff; if a replica won’t take the write, the coordinator fails over to the next one. Only a met quorum produces an ack — anything less is an honest error returned to the client.
def durable_write(key, value, replicas, quorum):
acks = 0
for r in replicas:
for attempt in range(MAX_RETRIES):
try:
r.write(key, value)
r.fsync() # force to stable storage
acks += 1
break # this replica is durable
except (IOError, Timeout) as e:
sleep(backoff(attempt)) # exponential backoff + jitter
if acks >= quorum:
break # enough replicas durable
if acks < quorum:
raise WriteError("durability not met") # tell the client — don't lie
return Ack(version=...) # only ack once quorum is durable
Notice what is not here: there is no early return after the buffered write(). The ack waits behind fsync and the quorum count. fsync’s own failure is caught too — it raises, so that replica simply never counts toward the quorum.
| What you pay | Why |
|---|---|
Durability up to n − quorum losses | With 3 replicas and quorum 2, the write survives any single replica failing — that is the whole point of waiting. |
| Higher write latency | fsync forces a slow flush to stable storage, and you wait for the slowest replica in the quorum, not the fastest. |
| Throughput vs. durability | Forcing every write down caps how many writes per second a disk sustains. Looser durability is faster but loses data on power loss. |
| Retry amplification | Each retry is extra load. A sick replica can multiply traffic; without a cap and jitter, retries become a self-inflicted storm. |
fsync. A buffered write() that returns is not durable. The OS page cache can still lose it on a power cut. Acknowledge after the flush, not before.fsync’s return code. fsync can fail — and on some systems a failed flush even drops the dirty pages (“fsyncgate”). Check the result and treat a failed flush as a failed write.A banking client deposits $200 and the coordinator must store it durably across 3 replicas with a quorum of 2.
R1 writes, fsyncs, and acks — durable acks: 1. R2 hits a bad sector and raises EIO; the coordinator does not count it and does not ack the client. It backs off and retries R2 once — still EIO. Rather than burn more of the retry budget, it fails over to standby R3, which writes, fsyncs, and acks — durable acks: 2. The quorum of 2 is met, so now the coordinator returns success to the bank.
If the quorum had never been reached, the coordinator would return a WriteError instead of a false success. Because the deposit carried an idempotency key, the client can safely retry the whole request — the same key keeps the $200 from landing twice.
Two replicas have acked durably and the quorum is 2, but a third replica is still mid-retry. What should the coordinator do?