Uploading a big file in parts

Cut a giant file into chunks, send each one on its own, then ask the server to glue them back together.

The idea

Trying to push a 5 GB video over the network as one request is fragile. One dropped packet near the end and you start the whole thing over. Multipart upload splits the object into independent parts (chunks). Each part is uploaded on its own request and the server hands back a small receipt called an ETag.

Because the parts are independent, they can travel in parallel and in any order, a flaky connection only forces you to retry the one failed part rather than the whole file, and an interrupted upload is resumable. When every part has landed, you complete the upload by sending the ordered list of ETags, and the server concatenates the parts into one final object.

See it work

Press play, or step through, to watch a file upload part by part.

How it works

There are three logical calls: open an upload, send each part, then complete. The upload ID ties the parts together, and each part comes back with an ETag you must keep. Only complete turns the scattered parts into a real object.

# 1) Initiate — the server returns an upload ID that ties the parts together
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = mpu["UploadId"]

parts = []
for n, chunk in enumerate(split(file, part_size=100 * 1024 * 1024), start=1):
    # 2) Upload each part independently. Parts can go in parallel / any order.
    #    If this one fails, retry JUST this part — the others are untouched.
    resp = retry(lambda: s3.upload_part(
        Bucket=bucket, Key=key, UploadId=upload_id,
        PartNumber=n, Body=chunk,
    ))
    parts.append({"PartNumber": n, "ETag": resp["ETag"]})

# 3) Complete — send the ORDERED list of ETags; server concatenates the parts
s3.complete_multipart_upload(
    Bucket=bucket, Key=key, UploadId=upload_id,
    MultipartUpload={"Parts": sorted(parts, key=lambda p: p["PartNumber"])},
)

# On any unrecoverable failure, free the orphaned parts so they stop costing money:
#   s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)

Cost & signals

What	The trade-off
Per-part overhead	Each part is its own request, and providers set a minimum part size (often ~5 MB), so tiny files gain nothing from splitting.
Parallelism	Independent parts upload concurrently, so wall-clock time drops roughly with the number of parallel connections.
Retry cost	A failure re-sends one part (e.g. 100 MB) instead of the whole 5 GB file — the main reason multipart exists.
Orphaned parts	Parts uploaded but never completed or aborted sit in storage and bill you until removed.
Completion latency	`complete` makes the server concatenate parts into one object — usually fast, but not instant for very large objects.
Signal	Incomplete multipart uploads piling up in a bucket is a classic hidden bill — add a lifecycle rule to abort uploads older than N days.

Watch out for

Forgetting to abort failed uploads. Orphaned parts keep costing storage silently. Call abort_multipart_upload on give-up, and add a lifecycle rule as a backstop.
Parts below the provider minimum. Every part except the last must meet the minimum size (often ~5 MB). Undersized middle parts get rejected at complete.
Sending parts out of order in the complete list. The server concatenates strictly by PartNumber. A scrambled list produces a corrupt object, not an error you'll notice immediately.
Not persisting the upload ID and ETags. Lose them and you can't resume or complete — you can only abort and start over.
Assuming a 200 on a part means the object exists. It does not. The object only appears after complete succeeds.

Worked example

You're uploading a 5 GB video over a flaky hotel connection, chunked into 100 MB parts — that's 50 parts. You initiate once and get an upload ID. Parts stream up in parallel, each returning an ETag you store alongside its part number.

Part 37 fails mid-flight. Because parts are independent, you retry only part 37 — the other 49 are already safely uploaded, so you re-send 100 MB, not 5 GB. Once all 50 ETags are collected, you call complete with the parts sorted by number, and the server stitches them into the final video.

Had the upload been abandoned instead, those 49 stored parts would keep billing you until an abort (or a lifecycle rule) cleaned them up.

Check yourself

Part 37 of a 50-part upload fails. What's the cheapest correct fix?

Every part returned a 200 and an ETag, but you never called complete. Where's the object?