A relay forwards voice and video packets between callers in real time, fast enough that a dropped one is better skipped than waited for.
On a live call, audio and video are chopped into tiny RTP packets and sent over the network. A relay in the middle forwards them between the two callers — useful when the callers can't reach each other directly, or when one stream needs to fan out to many viewers.
The network is messy: packets arrive out of order, late, or not at all. The receiver keeps a small jitter buffer that briefly holds packets so it can re-sort them by sequence number before playing. But it never waits forever — for a live call, bounded latency beats perfect delivery. A late or lost packet is concealed and the audio moves on.
Every RTP packet carries a small header: a sequence number (increments by one per packet, so the receiver can detect gaps and reorder), a timestamp (the media sampling instant, so playout is paced correctly), and an SSRC (which stream this is). The relay typically forwards packets without re-encoding — a TURN relay copies bytes verbatim; an SFU selectively forwards each sender's stream to subscribers. Cheap and low-latency, because there's no transcoding.
The receiver runs a jitter buffer: incoming packets are inserted by sequence number and held just long enough to absorb network variance. A fixed playout clock pops the next expected sequence on each tick. If that sequence hasn't arrived by its deadline, the buffer conceals it (interpolates or skips) and advances — it never blocks the call waiting for one packet. Loss and timing are reported back over RTCP so the sender can adapt its bitrate.
on_packet(p):
buffer.insert(p.seq, p) # reorder by sequence number
on_tick(now): # fixed playout clock
want = next_seq
if want in buffer:
play(buffer.pop(want)); next_seq += 1
elif now - deadline(want) > 0:
conceal(want); next_seq += 1 # don't wait past the deadline
| Choice | Cost | Note |
|---|---|---|
| Deeper jitter buffer | + latency | Tolerates more reorder & loss; too deep and the call feels laggy |
| Shallow jitter buffer | + loss / glitches | Low latency, but late packets miss their deadline and get concealed |
| Relay (TURN / SFU) | + latency, + bandwidth $ | Works behind any NAT; one stream can fan out to many subscribers |
| Direct P2P | NAT traversal can fail | Lowest latency when it connects; no server media cost |
| UDP / RTP transport | Must handle loss yourself | No head-of-line blocking — a lost packet never stalls the rest |
| TCP transport | Head-of-line blocking | Retransmits stall everything behind the loss — wrong for live media |
<, or reordering breaks at the wrap.Sender A emits packets 1..7, each forwarded by the relay. The network reorders packet 4 so it arrives after 5, and drops packet 6 entirely. Watch what the receiver's jitter buffer does:
arrive: 1 2 3 5 4 7 (6 never shows up)
buffer: holds out-of-order packets, sorted by seq
playout (fixed clock, next_seq advancing):
1 -> play 2 -> play 3 -> play
4 -> in buffer (it arrived late) -> play 5 -> play
6 -> not here, deadline passed -> conceal, skip
7 -> play
result: 1 2 3 4 5 (6 concealed) 7 with bounded delay
Packet 4 was late but still beat its deadline, so the buffer reordered it back ahead of 5 and played it in order. Packet 6 missed its deadline, so instead of stalling the whole call, the receiver concealed it and moved straight on to 7. Smooth audio, one tiny gap, no freeze.
The receiver is still missing packet 6 and its playout deadline just passed. What should the jitter buffer do?
Why is TCP a poor transport for a live voice call?