Turn a record into compact bytes with numbered fields, so old and new code can still read each other's data.
A record like {id: 7, name: "Ada", active: true} lives in memory as objects and pointers. To store it on disk or send it over a wire, you have to flatten it into a sequence of bytes — that's encoding (a.k.a. serialization).
Text formats like JSON are self-describing: they spell out every field name in every record. Schema-based binary formats like Protocol Buffers, Thrift, and Avro instead refer to each field by a small number. That's smaller and faster — and, crucially, it lets a reader on an old schema and a writer on a new schema keep understanding each other, as long as you only ever add optional fields under new tag numbers.
In a protobuf-style wire format, each field is a tag followed by a value. The tag is a single varint that packs the field number and a 3-bit wire type:
tag = (field_number << 3) | wire_type
wire types: 0 = varint (int32, int64, bool, enum)
2 = length-delim (string, bytes, sub-message)
(1 = 64-bit, 5 = 32-bit fixed)
# id = 7 -> field 1, wire 0: tag (1<<3)|0 = 0x08 , value 0x07
# name="Ada" -> field 2, wire 2: tag (2<<3)|2 = 0x12 , len 0x03, "Ada"=41 64 61
# active=true -> field 3, wire 0: tag (3<<3)|0 = 0x18 , value 0x01
The reason this survives schema changes is the wire type. When a reader hits a tag it doesn't recognize, the wire type tells it how many bytes the value occupies, so it can skip cleanly and move on:
while not end_of_buffer:
tag = read_varint()
field_no = tag >> 3
wire_type = tag & 0x07
if field_no in schema:
parse_value(wire_type) # known: decode it
else:
skip(wire_type) # unknown: varint -> read 1 varint
# len-delim -> read len, skip len bytes
# any field present in schema but absent in bytes -> use its default
Protobuf / Thrift carry these tags inline with the data. Avro takes a different route: it stores no per-field tags at all — the bytes are positional — and instead pairs the data with the writer's schema, resolving it against the reader's schema field-by-name at read time. Both achieve evolution; they just put the "what field is this" knowledge in different places.
| Dimension | JSON (text) | Tagged binary (protobuf) |
|---|---|---|
| Size | Larger — every field name repeated in every record | Smaller — field numbers + varint values; {id:7…} ≈ 35 B vs 9 B |
| Parse speed | Slower — tokenize text, match string keys | Faster — read tag, branch on a number; no string matching |
| Schema coupling | Self-describing — readable without a schema | Needs the schema / IDL to interpret the numbers |
| Evolution safety | Loose — keys are free-form; precision/typing pitfalls | Safe to add optional fields; renumbering or reusing a tag breaks compat |
reserve its tag number forever — never hand it to a different field.int32 → string). The wire type or value layout no longer matches, so decoding old bytes produces garbage rather than an error.int64 values lose precision in JSON's double-based numbers — encode them as strings. And on any format, decide what an unknown enum value or oneof case defaults to, or readers diverge.Take {id: 7, name: "Ada"}. As JSON it's the string {"id":7,"name":"Ada"} — 22 bytes, and the field names id and name are along for the ride. As tagged binary it's just 08 07 12 03 41 64 61 — 7 bytes: tag+value for id, then tag+length+UTF-8 for name.
Now the schema grows a fourth field, email (field 4, optional string). Two directions, both fine:
22 06 61 40 78 2e 69 6f for email. The old reader doesn't know tag 4, sees wire type 2, reads the length (6), and skips 6 bytes — id and name decode normally.email was never present, and fills in its default "".Both directions work because the tag numbers 1, 2, 3 never moved. Stable tags are the contract.
You remove an old field from your protobuf schema, and later you want to add a different, unrelated field. To stay safe, what must you do with the removed field's tag number?
Why does a newly added field need to be optional rather than required?