Encoding and schema evolution

Turn a record into compact bytes with numbered fields, so old and new code can still read each other's data.

The idea

A record like {id: 7, name: "Ada", active: true} lives in memory as objects and pointers. To store it on disk or send it over a wire, you have to flatten it into a sequence of bytes — that's encoding (a.k.a. serialization).

Text formats like JSON are self-describing: they spell out every field name in every record. Schema-based binary formats like Protocol Buffers, Thrift, and Avro instead refer to each field by a small number. That's smaller and faster — and, crucially, it lets a reader on an old schema and a writer on a new schema keep understanding each other, as long as you only ever add optional fields under new tag numbers.

See it work

Press play to encode the record field by field, then watch schema evolution.

How it works

In a protobuf-style wire format, each field is a tag followed by a value. The tag is a single varint that packs the field number and a 3-bit wire type:

tag = (field_number << 3) | wire_type

wire types:  0 = varint        (int32, int64, bool, enum)
             2 = length-delim   (string, bytes, sub-message)
            (1 = 64-bit, 5 = 32-bit fixed)

# id = 7  -> field 1, wire 0:  tag (1<<3)|0 = 0x08 , value 0x07
# name="Ada" -> field 2, wire 2: tag (2<<3)|2 = 0x12 , len 0x03, "Ada"=41 64 61
# active=true -> field 3, wire 0: tag (3<<3)|0 = 0x18 , value 0x01

The reason this survives schema changes is the wire type. When a reader hits a tag it doesn't recognize, the wire type tells it how many bytes the value occupies, so it can skip cleanly and move on:

while not end_of_buffer:
    tag       = read_varint()
    field_no  = tag >> 3
    wire_type = tag & 0x07
    if field_no in schema:
        parse_value(wire_type)          # known: decode it
    else:
        skip(wire_type)                 # unknown: varint -> read 1 varint
                                        #          len-delim -> read len, skip len bytes
# any field present in schema but absent in bytes -> use its default

Protobuf / Thrift carry these tags inline with the data. Avro takes a different route: it stores no per-field tags at all — the bytes are positional — and instead pairs the data with the writer's schema, resolving it against the reader's schema field-by-name at read time. Both achieve evolution; they just put the "what field is this" knowledge in different places.

Cost

DimensionJSON (text)Tagged binary (protobuf)
SizeLarger — every field name repeated in every recordSmaller — field numbers + varint values; {id:7…} ≈ 35 B vs 9 B
Parse speedSlower — tokenize text, match string keysFaster — read tag, branch on a number; no string matching
Schema couplingSelf-describing — readable without a schemaNeeds the schema / IDL to interpret the numbers
Evolution safetyLoose — keys are free-form; precision/typing pitfallsSafe to add optional fields; renumbering or reusing a tag breaks compat

Watch out for

Worked example

Take {id: 7, name: "Ada"}. As JSON it's the string {"id":7,"name":"Ada"} — 22 bytes, and the field names id and name are along for the ride. As tagged binary it's just 08 07 12 03 41 64 61 — 7 bytes: tag+value for id, then tag+length+UTF-8 for name.

Now the schema grows a fourth field, email (field 4, optional string). Two directions, both fine:

Both directions work because the tag numbers 1, 2, 3 never moved. Stable tags are the contract.

Check yourself

You remove an old field from your protobuf schema, and later you want to add a different, unrelated field. To stay safe, what must you do with the removed field's tag number?

Why does a newly added field need to be optional rather than required?