Flow telemetry collector protocol

The router never mails you the packets — it mails you a tally. You learn who talked to whom and how much, by best-effort post that sometimes goes missing.

The idea

A router or switch watches packets fly by and folds them into flows — groups keyed by the 5-tuple (src IP, dst IP, src port, dst port, protocol). For each flow it keeps a row in a flow cache, ticking up packet and byte counters as matching packets arrive.

When a flow expires — idle too long, open too long, a TCP FIN/RST, or the cache fills — the device exports one compact record over UDP to a collector. So what you receive is a stream of summaries, not a full packet capture. That makes it cheap and scalable, but it also means the counts are estimates, the UDP can be lost, and (in IPFIX) you can't decode a record until its template has arrived.

Press play. Packets arrive, fold into flow rows by 5-tuple, and export to the collector on expiry.

How it works

The observation point hashes each sampled packet's 5-tuple to find or create a cache row, then adds 1 to packets and the packet length to bytes. A flow leaves the cache — and is exported — under any of four conditions:

Export goes out over UDP — fire-and-forget, no acks, no retransmit. In IPFIX the record is just field values in a packed binary layout; the template that names and types those fields is sent separately and periodically. A collector that hasn't yet seen the matching template literally cannot decode the data record.

cache = {}                     # 5-tuple  -> flow row

def on_packet(pkt):
    if sampled_out(pkt): return            # 1:N sampling drops the rest
    key = (pkt.src_ip, pkt.dst_ip,
           pkt.src_port, pkt.dst_port, pkt.proto)
    f = cache.get(key)
    if f is None:
        f = cache[key] = {"packets": 0, "bytes": 0,
                          "start": now(), "last": now()}
    f["packets"] += 1
    f["bytes"]   += pkt.length
    f["last"]     = now()
    if pkt.tcp_fin or pkt.tcp_rst:
        export(key, cache.pop(key))        # FIN/RST -> flush now

def sweep():                               # runs on a timer
    for key, f in list(cache.items()):
        if now() - f["last"]  > INACTIVE: export(key, cache.pop(key))
        elif now() - f["start"] > ACTIVE:  export(key, cache.pop(key))

def export(key, f):
    f["packets"] *= sampling_rate          # scale back up to an estimate
    f["bytes"]   *= sampling_rate
    udp_send(collector, encode(key, f))    # best-effort, may be lost

Note the *= sampling_rate: if the device only inspected 1 in N packets, it multiplies the counts by N to estimate the true total. The estimate is unbiased on average but noisy for small flows.

Signals & trade-offs

LeverEffectWatch
Sampling 1:1Exact counts, every flow seenHeavy CPU / cache load on the router
Sampling 1:1000Cheap, scales to backbone linksCounts are ×1000 estimates; small flows missed
Active timeout shortFresher data, faster visibilityMore export traffic, splits long flows
Active timeout longFewer, fatter recordsStale view of in-progress conversations
UDP exportCheap, no per-record router stateSilent loss — no retransmit, gaps appear

Watch out for

Worked example

A web server downloads a 1.5 MB file to a client over one TCP connection — say 10.0.0.7:443 → 10.0.0.40:51020. At the edge router the first data packet creates a flow row; each subsequent packet ticks packets and adds its length to bytes. At 1:1 sampling the row reaches roughly packets ≈ 1100, bytes ≈ 1,500,000. When the client's FIN arrives, the row is flushed immediately and one IPFIX record leaves over UDP. If that single datagram is lost in transit, the collector simply never learns about a 1.5 MB transfer — and with no retransmit, it never will. Bump sampling to 1:1000 and the same flow is built from roughly one observed packet, then multiplied by 1000: the estimate is in the right ballpark but could easily read bytes ≈ 1,200,000 or 2,000,000.

Check yourself

Your collector starts receiving IPFIX data records right after a router reboot, but it can't decode them — every field reads as garbage. What is the most likely cause?