Late Data (Streaming ML)

When a user's phone goes into a tunnel and events arrive out of order.

The idea

In streaming ML pipelines (e.g., calculating "Number of clicks in the last 5 minutes" to feed an ad-recommender), data is processed continuously using Time Windows. However, physical reality is messy. A user might click an ad while on a train going through a tunnel. Their phone loses signal. The "Click" event doesn't reach your servers until the train exits the tunnel 10 minutes later. The event's Event Time (when it actually happened) is different from its Processing Time (when your server received it). This is Late Data, and it ruins your Time Window calculations.

Step 1: A Time Window is open to calculate clicks between 12:00 and 12:05.

How it works (Watermarks)

To handle Late Data, stream processors like Apache Flink or Spark Structured Streaming use a concept called Watermarks. A Watermark is a grace period. Instead of closing the 12:00-12:05 window exactly at 12:05, the system leaves the window open in memory for an extra 10 minutes (the Watermark), allowing late events from tunnels to sneak in and be counted properly before the final total is sent to the ML model.

# Spark Structured Streaming Example
# 1. We group events into 5-minute windows based on EVENT time
# 2. We add a 10-minute Watermark to forgive late data!
clicks_df \
  .withWatermark("event_timestamp", "10 minutes") \
  .groupBy(
      window("event_timestamp", "5 minutes"),
      "user_id"
  ).count()

# If an event arrives 11 minutes late, it is explicitly discarded.

Cost

Watermarks require state (RAM). If you leave a 5-minute window open for 10 extra minutes, the streaming engine has to hold all those partial calculations in memory. If you set a Watermark of 24 hours (just in case a phone was off all day), you will blow up the memory of your streaming cluster. It is a strict trade-off between Data Accuracy and RAM usage.

Watch out for

Online/Offline Skew: Late Data causes training data mismatches. Your real-time streaming pipeline might have missed the late click (predicting 0), but your nightly batch job (which has all the data) sees the click and records 1. When you train a model on the historical data, it learns patterns that your real-time engine physically cannot see.