Tokenizer version mismatch

A model only ever sees numbers — feed it numbers from the wrong dictionary and it answers fluently to a question you never asked.

The idea

A served model is paired with one specific tokenizer — a vocabulary that maps text to token IDs. The model learned what each ID means during training. If the serving stack loads a different tokenizer version, the same input text splits into different chunks and maps to different IDs.

The model then receives IDs that mean something else, or that fall outside its vocabulary entirely. Nothing crashes. There is no exception, no 500, no log line — just quietly wrong predictions. That silence is what makes this bug expensive.

Input text

Press play. The same text runs through both tokenizers — v1 is what the model trained on, v2 is what the serving stack loaded.

How it works

Tokenization is deterministic given a vocabulary. The model's embedding table has one row per ID, learned during training. Token ID 1899 only means "happi" because the model saw 1899 wherever "happi" appeared in training. Swap the vocabulary and the same byte string resolves to different IDs — pointing the model at rows that encode something else, or at an out-of-vocabulary marker it barely understands.

So the model and its tokenizer must load as one version-pinned pair:

# Pin the pair. Same revision for both — never load them separately.
REV = "model-2026-06-12"
tok   = Tokenizer.load(f"artifacts/{REV}/tokenizer")   # vocab.json + merges
model = Model.load(f"artifacts/{REV}/weights")

assert tok.vocab_hash == model.expected_vocab_hash, "tokenizer/model mismatch"

text = "unhappiness"
print(tok_v1.encode(text))   # [412, 1899, 305]    <- what the model learned
print(tok_v2.encode(text))   # [412, 7340, 0]      <- 0 is [UNK]; row 7340 means something else
# Same text. Same model. Different numbers in -> different meaning -> wrong answer, no error.

The guard that saves you is the assert: a vocab hash (or revision tag) checked at load time, so a mismatch fails loudly at startup instead of silently at inference.

Signals & trade-offs

Signal	What it means	Action
Quality drop, zero errors	Inference still returns — just wrong. The classic silent mismatch.	Diff served token IDs against the training tokenizer on a fixed sample.
Offline eval good, online bad	Eval used the right tokenizer; serving loaded another.	Round-trip the deployed tokenizer through the eval harness.
Mixed results across replicas	A rolling deploy left v1 and v2 tokenizers split across the fleet.	Pin the tokenizer to the model revision; redeploy atomically.
Spike in [UNK] / OOV rate	Text is hitting tokens the loaded vocab can't represent.	Alert on the OOV ratio per request; compare to the training baseline.

Watch out for

Separately versioned artifacts. Model weights in one bucket, vocab.json in another, deployed on different cadences. They drift, and nothing connects them.
Adding special tokens. Inserting [CLS] or a new domain token shifts every ID after it. The text looks identical; the numbers all move.
Rolling-deploy split-brain. Mid-rollout, some replicas serve v1's tokenizer and some serve v2's. The same request gets different answers depending on which pod it lands on.
Silent out-of-vocabulary. An unknown chunk maps to [UNK] (often ID 0). No crash — the model just gets a near-meaningless token and shrugs.
No round-trip integration test. Nothing in CI loads the exact served tokenizer, encodes a fixture, and asserts the IDs match the model's expected vocabulary.

Worked example

A sentiment service ships fine for months. A data engineer updates vocab.json to add three domain tokens for a new product line and redeploys only the tokenizer artifact — the model weights are untouched. Adding those tokens shifts every later ID, so "unhappiness" now encodes to [412, 7340, 0] instead of the trained [412, 1899, 305]: a shifted middle token and an [UNK] tail. Live accuracy sags from 91% to 64%, yet every request returns 200 OK and offline eval (which still uses the old tokenizer) looks perfect. On-call spends a day chasing the model before someone diffs the served token IDs against training and finds the tokenizer was bumped without retraining.

Check yourself

Online accuracy dropped sharply, but there are no errors and offline eval still looks great. What is the most likely cause?