Before any audio flows, a chain of SIP messages negotiates the call path between caller, proxies, and callee.
When you place a voice or video call over the internet, the audio doesn't just appear. First, the two ends have to find each other and agree to talk. SIP (Session Initiation Protocol) is the signalling layer that handles this introduction.
The caller's phone (the UAC, user agent client) sends an INVITE. It travels through SIP proxies and a registrar that look up where the callee (the UAS, user agent server) is currently registered, then forward it. The callee rings (180 Ringing), answers (200 OK), and the caller confirms (ACK). Only then does the actual media stream begin — and the media (RTP) usually flows directly between the two endpoints, bypassing the proxy entirely. SIP routes the signalling, not the media.
SIP is a text protocol that looks a lot like HTTP. A request has a method line, a stack of headers, and a body. The Via headers record the path the request took so the response can retrace it; From and To identify the parties. Here is a trimmed exchange for Alice calling Bob:
INVITE sip:bob@example.com SIP/2.0
Via: SIP/2.0/UDP pc.alice.example.com;branch=z9hG4bK77
Max-Forwards: 70
From: Alice <sip:alice@example.com>;tag=1928
To: Bob <sip:bob@example.com>
Call-ID: a84b4c76e66710
CSeq: 1 INVITE
Contact: <sip:alice@pc.alice.example.com>
Content-Type: application/sdp
... SDP body offers codecs & the caller's media address ...
SIP/2.0 100 Trying (proxy: I'm working on it)
SIP/2.0 180 Ringing (Bob's phone is ringing)
SIP/2.0 200 OK (Bob picked up; SDP answer attached)
ACK sip:bob@pc.bob.example.com SIP/2.0
... three-way handshake done; media (RTP) now flows peer-to-peer ...
The INVITE / 200 OK / ACK trio is the three-way handshake that confirms both sides are ready. Record-Route lets a proxy insert itself into the path so that later in-dialog messages (like BYE) still flow through it.
| Aspect | Cost | Signal to watch |
|---|---|---|
| Signalling vs media separation | Two planes to operate; media goes peer-to-peer while SIP stays on the proxies. | Signalling succeeds but the call is silent — the media path failed independently. |
| NAT traversal | SIP and RTP both struggle through NAT; you need STUN, TURN, or ICE. | One-way audio, or audio only on the same LAN. |
| Statefulness | Stateful proxies track transactions (memory); stateless ones are cheaper but blind. | Lost retransmits or duplicate dialogs when state assumptions break. |
| Latency | Extra round trips through proxies before the callee even rings. | Slow post-dial delay; users hear silence before ringback. |
| Interop | Header quirks and optional features vary across vendors. | Calls work to some destinations but fail to others on the same setup. |
ACK. Without it the callee keeps retransmitting 200 OK and the call setup hangs instead of completing.Record-Route, so in-dialog requests and responses can't find the return path back through the proxy.1xx) responses like 100 Trying and 180 Ringing — they're not final, and the dialog isn't established yet.Alice calls Bob through one proxy. Alice's phone sends INVITE to the proxy. The proxy looks up Bob's current registration (he registered earlier from his desk phone) and forwards the INVITE to him, returning 100 Trying to Alice so she knows it's in flight. Bob's phone starts 180 Ringing — that's where Alice hears ringback. Bob answers: 200 OK travels back Bob → proxy → Alice. Alice confirms with ACK down the same chain, completing the three-way handshake. Now the talking begins: media (RTP) flows directly between Alice and Bob, not through the proxy. When Bob hangs up, his phone sends BYE (acknowledged with 200 OK) and the session tears down. The ladder above walks exactly these messages, step by step.
Does the actual voice audio flow through the SIP proxy?
Which message completes the three-way handshake that establishes the call?