Encoding and schema evolution

Turn a record into compact bytes with numbered fields, so old and new code can still read each other's data.

The idea

A record like {id: 7, name: "Ada", active: true} lives in memory as objects and pointers. To store it on disk or send it over a wire, you have to flatten it into a sequence of bytes — that's encoding (a.k.a. serialization).

Text formats like JSON are self-describing: they spell out every field name in every record. Schema-based binary formats like Protocol Buffers, Thrift, and Avro instead refer to each field by a small number. That's smaller and faster — and, crucially, it lets a reader on an old schema and a writer on a new schema keep understanding each other, as long as you only ever add optional fields under new tag numbers.

See it work

Press play to encode the record field by field, then watch schema evolution.

How it works

In a protobuf-style wire format, each field is a tag followed by a value. The tag is a single varint that packs the field number and a 3-bit wire type:

tag = (field_number << 3) | wire_type

wire types:  0 = varint        (int32, int64, bool, enum)
             2 = length-delim   (string, bytes, sub-message)
            (1 = 64-bit, 5 = 32-bit fixed)

# id = 7  -> field 1, wire 0:  tag (1<<3)|0 = 0x08 , value 0x07
# name="Ada" -> field 2, wire 2: tag (2<<3)|2 = 0x12 , len 0x03, "Ada"=41 64 61
# active=true -> field 3, wire 0: tag (3<<3)|0 = 0x18 , value 0x01

The reason this survives schema changes is the wire type. When a reader hits a tag it doesn't recognize, the wire type tells it how many bytes the value occupies, so it can skip cleanly and move on:

while not end_of_buffer:
    tag       = read_varint()
    field_no  = tag >> 3
    wire_type = tag & 0x07
    if field_no in schema:
        parse_value(wire_type)          # known: decode it
    else:
        skip(wire_type)                 # unknown: varint -> read 1 varint
                                        #          len-delim -> read len, skip len bytes
# any field present in schema but absent in bytes -> use its default

Protobuf / Thrift carry these tags inline with the data. Avro takes a different route: it stores no per-field tags at all — the bytes are positional — and instead pairs the data with the writer's schema, resolving it against the reader's schema field-by-name at read time. Both achieve evolution; they just put the "what field is this" knowledge in different places.

Cost

Dimension	JSON (text)	Tagged binary (protobuf)
Size	Larger — every field name repeated in every record	Smaller — field numbers + varint values; `{id:7…}` ≈ 35 B vs 9 B
Parse speed	Slower — tokenize text, match string keys	Faster — read tag, branch on a number; no string matching
Schema coupling	Self-describing — readable without a schema	Needs the schema / IDL to interpret the numbers
Evolution safety	Loose — keys are free-form; precision/typing pitfalls	Safe to add optional fields; renumbering or reusing a tag breaks compat

Watch out for

Reusing or renumbering an existing tag. Old stored bytes still carry the old field under that number, so new code silently misreads them. When you delete a field, reserve its tag number forever — never hand it to a different field.
Changing a field's type incompatibly (e.g. int32 → string). The wire type or value layout no longer matches, so decoding old bytes produces garbage rather than an error.
Making a new field required. Old writers can't emit it, so new readers reject perfectly valid old records. Add new fields as optional with a sensible default.
Relying on field order instead of tags. In tagged formats the field number is the identity; never assume fields arrive in declaration order or that order means anything.
JSON number precision and enums. Large int64 values lose precision in JSON's double-based numbers — encode them as strings. And on any format, decide what an unknown enum value or oneof case defaults to, or readers diverge.

Worked example

Take {id: 7, name: "Ada"}. As JSON it's the string {"id":7,"name":"Ada"} — 22 bytes, and the field names id and name are along for the ride. As tagged binary it's just 08 07 12 03 41 64 61 — 7 bytes: tag+value for id, then tag+length+UTF-8 for name.

Now the schema grows a fourth field, email (field 4, optional string). Two directions, both fine:

New writer → old reader (forward compat). The new bytes include 22 06 61 40 78 2e 69 6f for email. The old reader doesn't know tag 4, sees wire type 2, reads the length (6), and skips 6 bytes — id and name decode normally.
Old writer → new reader (backward compat). The old bytes have no tag 4 at all. The new reader finishes the buffer, notices email was never present, and fills in its default "".

Both directions work because the tag numbers 1, 2, 3 never moved. Stable tags are the contract.

Check yourself

You remove an old field from your protobuf schema, and later you want to add a different, unrelated field. To stay safe, what must you do with the removed field's tag number?

Why does a newly added field need to be optional rather than required?