Handling a write error

A write isn’t done when the function returns — it’s done when the data is durable. Until then, never tell the client “saved.”

The idea

Imagine handing a letter to a courier. The moment they take it from your hand, the letter is sent — but it is not yet delivered. If you walk away and assume it arrived, you might never learn that the courier dropped it in a puddle. A careful sender waits for a signed receipt before crossing the task off the list.

Durable storage works the same way. When you call write() the bytes often sit in an in-memory buffer, not yet on the platter or flash. The call can return success while the data is still one power cut away from vanishing. A robust storage layer never trusts that return value alone. It verifies durability (force the data down with fsync and check what comes back, or count acknowledgements from a quorum of replicas), retries transient failures with backoff, fails over to a healthy replica, and only then acknowledges the client. If durability can’t be reached, it returns an error rather than silently losing the write.

Press play, or step through, to watch a write reach durability.

How it works

The coordinator writes to each replica, forces the data to stable storage, and counts acknowledgements. It stops as soon as a quorum is durable. A transient failure is retried with backoff; if a replica won’t take the write, the coordinator fails over to the next one. Only a met quorum produces an ack — anything less is an honest error returned to the client.

def durable_write(key, value, replicas, quorum):
    acks = 0
    for r in replicas:
        for attempt in range(MAX_RETRIES):
            try:
                r.write(key, value)
                r.fsync()              # force to stable storage
                acks += 1
                break                  # this replica is durable
            except (IOError, Timeout) as e:
                sleep(backoff(attempt))  # exponential backoff + jitter
        if acks >= quorum:
            break                      # enough replicas durable
    if acks < quorum:
        raise WriteError("durability not met")  # tell the client — don't lie
    return Ack(version=...)            # only ack once quorum is durable

Notice what is not here: there is no early return after the buffered write(). The ack waits behind fsync and the quorum count. fsync’s own failure is caught too — it raises, so that replica simply never counts toward the quorum.

Cost

What you payWhy
Durability up to n − quorum lossesWith 3 replicas and quorum 2, the write survives any single replica failing — that is the whole point of waiting.
Higher write latencyfsync forces a slow flush to stable storage, and you wait for the slowest replica in the quorum, not the fastest.
Throughput vs. durabilityForcing every write down caps how many writes per second a disk sustains. Looser durability is faster but loses data on power loss.
Retry amplificationEach retry is extra load. A sick replica can multiply traffic; without a cap and jitter, retries become a self-inflicted storm.

Watch out for

Worked example

A banking client deposits $200 and the coordinator must store it durably across 3 replicas with a quorum of 2.

R1 writes, fsyncs, and acks — durable acks: 1. R2 hits a bad sector and raises EIO; the coordinator does not count it and does not ack the client. It backs off and retries R2 once — still EIO. Rather than burn more of the retry budget, it fails over to standby R3, which writes, fsyncs, and acks — durable acks: 2. The quorum of 2 is met, so now the coordinator returns success to the bank.

If the quorum had never been reached, the coordinator would return a WriteError instead of a false success. Because the deposit carried an idempotency key, the client can safely retry the whole request — the same key keeps the $200 from landing twice.

Check yourself

Two replicas have acked durably and the quorum is 2, but a third replica is still mid-retry. What should the coordinator do?