Distributed Filesystem

How cloud providers store files larger than any single hard drive.

The idea

If you want to store a 50 Terabyte dataset, it won't fit on a standard server's hard drive. A Distributed Filesystem (like HDFS or Amazon S3) solves this by chopping large files into chunks (e.g., 128MB each) and scattering them across thousands of cheap, standard servers (Data Nodes). A central "Name Node" acts like a directory, remembering which chunks are on which machines. To ensure data isn't lost when a hard drive inevitably dies, every chunk is duplicated (replicated) to 3 different servers.

Step 1: The user uploads a massive 300MB file. The system chops it into chunks.

How it works (The Name Node)

The client application never talks to the Data Nodes directly at first. It asks the central Name Node: "Where can I find the chunks for movie.mp4?". The Name Node replies with a list of IP addresses. The client then reaches out to those IP addresses directly, downloading chunks in parallel.

# The Name Node's internal memory looks like this:
{
  "movie.mp4": {
    "chunk_1": ["Server_A", "Server_B", "Server_E"],
    "chunk_2": ["Server_C", "Server_A", "Server_F"],
    "chunk_3": ["Server_D", "Server_B", "Server_C"]
  }
}

# The Name Node tracks where everything is, 
# and monitors heartbeats to replace dead servers.

Cost

Because every chunk is stored 3 times, storing 1 PB of data physically requires 3 PB of hard drives. It's also not suitable for millions of tiny 1KB files, because the Name Node keeps the entire directory map in RAM. A billion tiny files will exhaust the Name Node's memory, even if the Data Nodes have plenty of disk space (similar to Inode Exhaustion).

Watch out for

Latency vs Throughput: Distributed filesystems are designed for massive Throughput (streaming 100GB of data quickly by downloading from 20 servers at once). They are terrible at Latency. Seeking to a random byte in the middle of a file might take 50ms instead of 1ms on a local disk. Don't run a relational database directly on top of HDFS!