Read-after-write consistency

You just saved a file, ask for it back, and the system says “never heard of it” — because your read landed on a copy that hasn’t caught up yet.

The idea

When you upload an object, the store writes it to one primary node and answers success right away. The new bytes are then copied to several replicas in the background, so the system stays fast and available.

But a later read can be routed to any replica by a load balancer. If it lands on one that hasn’t received the copy yet, you see stale data — or a 404 for an object you know you just wrote. After a short window the replicas converge and every read agrees. That gap is the read-after-write hazard.

See it work

Press Play, or step through the write and the reads.

How it works

The write path is synchronous only to the primary; replication is fired off afterward. The read path picks some replica, so until replication lands it may return the old version. The fix is to make at least the read you care about avoid lagging replicas — route it to the primary, or carry a version token and demand a consistent read.

def put(key, bytes):
    primary.write(key, bytes)        # durable on primary, return now
    for r in replicas:
        replication_queue.enqueue(r, key, bytes)   # async, eventual
    return Ack(version=primary.version(key))

def get(key, consistent=False):
    if consistent:
        return primary.read(key)     # read-your-writes: skip replicas
    node = load_balancer.pick(replicas)
    return node.read(key)            # may be stale until convergence

# read-your-own-write with a version token
ack = put("avatar.png", data)
obj = get("avatar.png", consistent=True)   # or retry until version >= ack.version

Trade-offs

Dimension	Eventual (read replica)	Strong (read primary)
Read latency	Lower — nearest replica	Higher — one hot node
Staleness window	Milliseconds to seconds	None for that key
Read throughput	Scales with replica count	Bounded by the primary
Cost	Cheaper, fan-out reads	Pricier, less cacheable
App complexity	Must tolerate staleness / retry	Simpler mental model

Watch out for

A GET immediately after a PUT can return 404 — the object exists, but your read hit a replica that hasn’t received it.
Overwriting an existing key can briefly return the old version, not your new bytes, until that replica catches up.
A LIST right after a write may omit the new object, even though a direct GET on the primary would find it.
Don’t assume a global order across replicas — two clients can observe writes landing in different sequences during the window.
Caches and CDNs extend staleness well past the replication window; a cached 404 or old body can outlive convergence by minutes.

Worked example

You PUT s3://bucket/avatar.png and get 200 OK. Your UI immediately GETs it to render the new avatar. The load balancer routes that read to replica B, which is still 80 ms behind the replication queue — so it answers 404 Not Found. Your page shows a broken image even though the upload “succeeded.”

A retry 200 ms later is routed to replica A, which has now converged, and returns 200 OK with the bytes. The robust fix: retry with backoff, or read the object back from the primary (a consistent read) for the one request that must reflect your own write.

Note that modern S3 now gives strong read-after-write consistency for new objects and overwrites — AWS routes the read so this exact new-object 404 no longer happens. The eventual-consistency model here is still the right mental model for replicated stores in general, and for understanding why that guarantee mattered.

Check yourself

You upload report.pdf, get 200, then immediately GET it and receive a 404. What happened?

One request must reflect your own write. What’s the cleanest fix?