Pour the river through a cup, a sip at a time — never try to hold the whole river at once.
You're serving a 2 GB file to a client. The tempting code reads the whole file into memory, then writes it to the socket. That works on your laptop with one user. Under load, every concurrent download holds its full file in RAM at once, and the server runs out of memory.
Streaming reads the file in small, fixed-size chunks and forwards each chunk to the client before reading the next. Memory stays flat and small — one chunk at a time — no matter how big the file or how many people download it. The trade is a little more code for a server that doesn't fall over.
You open the file as a read handle and loop: read one chunk into a small reusable buffer, write that chunk to the response, repeat until the read returns nothing. The buffer is the only memory you hold — typically a few kilobytes — so peak memory is independent of file size.
CHUNK = 64 * 1024 # 64 KB reusable buffer
def stream_file(path, response):
with open(path, "rb") as f:
while True:
chunk = f.read(CHUNK) # read at most CHUNK bytes
if not chunk: # EOF: read returned empty
break
response.write(chunk) # forward, then let it be freed
response.end()
# Peak memory ~= one CHUNK, regardless of file size.
Most web frameworks expose this as a generator or a file-stream response. The key is that you never build a single object holding all the bytes.
| Signal | Load all | Stream |
|---|---|---|
| Peak memory per request | O(file size) | O(chunk size) |
| Memory under N concurrent | N × file size | N × chunk size |
| Time to first byte | After full read | After first chunk |
| Code complexity | Lower | Slightly higher |
read() and no size argument — that is the load-all bug, just hidden in a one-liner.Content-Length (or using chunked transfer encoding) — some clients hang waiting to know when the body ends.finally or use a context manager.Suppose 200 users each download a 1 GB video at once. Load-all needs roughly 200 GB of RAM held at the same time — impossible on a normal box, so the server crashes. Streaming with a 64 KB buffer needs about 200 × 64 KB ≈ 13 MB total. Same files, same users, but one design fits in memory and the other doesn't.
You stream the file but collect every chunk into a list, then b"".join(chunks) at the end. Does peak memory improve?