Handling a write error

A write isn’t done when the function returns — it’s done when the data is durable. Until then, never tell the client “saved.”

The idea

Imagine handing a letter to a courier. The moment they take it from your hand, the letter is sent — but it is not yet delivered. If you walk away and assume it arrived, you might never learn that the courier dropped it in a puddle. A careful sender waits for a signed receipt before crossing the task off the list.

Durable storage works the same way. When you call write() the bytes often sit in an in-memory buffer, not yet on the platter or flash. The call can return success while the data is still one power cut away from vanishing. A robust storage layer never trusts that return value alone. It verifies durability (force the data down with fsync and check what comes back, or count acknowledgements from a quorum of replicas), retries transient failures with backoff, fails over to a healthy replica, and only then acknowledges the client. If durability can’t be reached, it returns an error rather than silently losing the write.

Press play, or step through, to watch a write reach durability.

How it works

The coordinator writes to each replica, forces the data to stable storage, and counts acknowledgements. It stops as soon as a quorum is durable. A transient failure is retried with backoff; if a replica won’t take the write, the coordinator fails over to the next one. Only a met quorum produces an ack — anything less is an honest error returned to the client.

def durable_write(key, value, replicas, quorum):
    acks = 0
    for r in replicas:
        for attempt in range(MAX_RETRIES):
            try:
                r.write(key, value)
                r.fsync()              # force to stable storage
                acks += 1
                break                  # this replica is durable
            except (IOError, Timeout) as e:
                sleep(backoff(attempt))  # exponential backoff + jitter
        if acks >= quorum:
            break                      # enough replicas durable
    if acks < quorum:
        raise WriteError("durability not met")  # tell the client — don't lie
    return Ack(version=...)            # only ack once quorum is durable

Notice what is not here: there is no early return after the buffered write(). The ack waits behind fsync and the quorum count. fsync’s own failure is caught too — it raises, so that replica simply never counts toward the quorum.

Cost

What you pay	Why
Durability up to `n − quorum` losses	With 3 replicas and quorum 2, the write survives any single replica failing — that is the whole point of waiting.
Higher write latency	`fsync` forces a slow flush to stable storage, and you wait for the slowest replica in the quorum, not the fastest.
Throughput vs. durability	Forcing every write down caps how many writes per second a disk sustains. Looser durability is faster but loses data on power loss.
Retry amplification	Each retry is extra load. A sick replica can multiply traffic; without a cap and jitter, retries become a self-inflicted storm.

Watch out for

Returning success before fsync. A buffered write() that returns is not durable. The OS page cache can still lose it on a power cut. Acknowledge after the flush, not before.
Ignoring fsync’s return code. fsync can fail — and on some systems a failed flush even drops the dirty pages (“fsyncgate”). Check the result and treat a failed flush as a failed write.
Retrying a non-idempotent write without dedup. A blind retry can double-apply the change. Attach an idempotency key or write version so a replay is recognised and applied at most once.
Unbounded retries. Infinite retries against a struggling node create a retry storm that takes the cluster down with it. Cap the attempts and back off with jitter.
Acking on a single replica. One ack feels durable until that node dies and the data is gone. Require a quorum so the write survives a node loss.

Worked example

A banking client deposits $200 and the coordinator must store it durably across 3 replicas with a quorum of 2.

R1 writes, fsyncs, and acks — durable acks: 1. R2 hits a bad sector and raises EIO; the coordinator does not count it and does not ack the client. It backs off and retries R2 once — still EIO. Rather than burn more of the retry budget, it fails over to standby R3, which writes, fsyncs, and acks — durable acks: 2. The quorum of 2 is met, so now the coordinator returns success to the bank.

If the quorum had never been reached, the coordinator would return a WriteError instead of a false success. Because the deposit carried an idempotency key, the client can safely retry the whole request — the same key keeps the $200 from landing twice.

Check yourself

Two replicas have acked durably and the quorum is 2, but a third replica is still mid-retry. What should the coordinator do?