Batch Data Processing

Chewing through terabytes of logs using Map and Reduce.

The idea

If you have a 10 GB file of web logs and you want to count how many times each IP address visited, you could write a Python loop. But what if you have 10 Terabytes of logs? A single computer would take weeks! Distributed Data Processing (like Hadoop or Apache Spark) solves this by splitting the giant file into chunks, sending the chunks to hundreds of worker computers (Map), and then combining their answers together (Reduce).

Step 1: We have 3 chunks of log data. We want to count the occurrences of "A" and "B".

How it works (Map, Shuffle, Reduce)

The magic is the Shuffle phase. The Mapper workers parse the raw text and output (Key, Value) pairs. The framework then "shuffles" the network, guaranteeing that all pairs with the exact same Key end up on the exact same Reducer worker. The Reducer simply loops over the list and sums them up.

# Pseudo-code for Word Count in MapReduce

# 1. MAP PHASE (Runs in parallel on many machines)
# Input: "A B A"
def map(text_chunk):
    for word in text_chunk:
        emit(word, 1)  # Outputs: (A,1), (B,1), (A,1)

# -- SHUFFLE PHASE happens here automatically --
# The framework groups by key: A -> [1, 1], B -> [1]

# 2. REDUCE PHASE (Runs in parallel on many machines)
# Input: ("A", [1, 1])
def reduce(key, values):
    total = sum(values)
    emit(key, total)   # Outputs: (A, 2)

Cost

The Shuffle phase is extremely expensive because it requires massive network I/O to move data between the machines. The network becomes the bottleneck. Modern tools like Apache Spark optimize this by keeping as much intermediate data in RAM as possible, avoiding writing to disk between every step.

Watch out for

Data Skew: What if 99% of your web traffic comes from a single IP address (like Googlebot)? The Shuffle phase will send 99% of the data to exactly one Reducer machine! That machine will run out of memory and crash, while all the other Reducers sit idle. You must "salt" your keys or use Combiners to fix Data Skew.