How cloud providers store files larger than any single hard drive.
If you want to store a 50 Terabyte dataset, it won't fit on a standard server's hard drive. A Distributed Filesystem (like HDFS or Amazon S3) solves this by chopping large files into chunks (e.g., 128MB each) and scattering them across thousands of cheap, standard servers (Data Nodes). A central "Name Node" acts like a directory, remembering which chunks are on which machines. To ensure data isn't lost when a hard drive inevitably dies, every chunk is duplicated (replicated) to 3 different servers.
The client application never talks to the Data Nodes directly at first. It asks the central Name Node: "Where can I find the chunks for movie.mp4?". The Name Node replies with a list of IP addresses. The client then reaches out to those IP addresses directly, downloading chunks in parallel.
# The Name Node's internal memory looks like this:
{
"movie.mp4": {
"chunk_1": ["Server_A", "Server_B", "Server_E"],
"chunk_2": ["Server_C", "Server_A", "Server_F"],
"chunk_3": ["Server_D", "Server_B", "Server_C"]
}
}
# The Name Node tracks where everything is,
# and monitors heartbeats to replace dead servers.
Because every chunk is stored 3 times, storing 1 PB of data physically requires 3 PB of hard drives. It's also not suitable for millions of tiny 1KB files, because the Name Node keeps the entire directory map in RAM. A billion tiny files will exhaust the Name Node's memory, even if the Data Nodes have plenty of disk space (similar to Inode Exhaustion).