A model only ever sees numbers — feed it numbers from the wrong dictionary and it answers fluently to a question you never asked.
A served model is paired with one specific tokenizer — a vocabulary that maps text to token IDs. The model learned what each ID means during training. If the serving stack loads a different tokenizer version, the same input text splits into different chunks and maps to different IDs.
The model then receives IDs that mean something else, or that fall outside its vocabulary entirely. Nothing crashes. There is no exception, no 500, no log line — just quietly wrong predictions. That silence is what makes this bug expensive.
Tokenization is deterministic given a vocabulary. The model's embedding table has one row per ID, learned during training. Token ID 1899 only means "happi" because the model saw 1899 wherever "happi" appeared in training. Swap the vocabulary and the same byte string resolves to different IDs — pointing the model at rows that encode something else, or at an out-of-vocabulary marker it barely understands.
So the model and its tokenizer must load as one version-pinned pair:
# Pin the pair. Same revision for both — never load them separately.
REV = "model-2026-06-12"
tok = Tokenizer.load(f"artifacts/{REV}/tokenizer") # vocab.json + merges
model = Model.load(f"artifacts/{REV}/weights")
assert tok.vocab_hash == model.expected_vocab_hash, "tokenizer/model mismatch"
text = "unhappiness"
print(tok_v1.encode(text)) # [412, 1899, 305] <- what the model learned
print(tok_v2.encode(text)) # [412, 7340, 0] <- 0 is [UNK]; row 7340 means something else
# Same text. Same model. Different numbers in -> different meaning -> wrong answer, no error.
The guard that saves you is the assert: a vocab hash (or revision tag) checked at load time, so a mismatch fails loudly at startup instead of silently at inference.
| Signal | What it means | Action |
|---|---|---|
| Quality drop, zero errors | Inference still returns — just wrong. The classic silent mismatch. | Diff served token IDs against the training tokenizer on a fixed sample. |
| Offline eval good, online bad | Eval used the right tokenizer; serving loaded another. | Round-trip the deployed tokenizer through the eval harness. |
| Mixed results across replicas | A rolling deploy left v1 and v2 tokenizers split across the fleet. | Pin the tokenizer to the model revision; redeploy atomically. |
| Spike in [UNK] / OOV rate | Text is hitting tokens the loaded vocab can't represent. | Alert on the OOV ratio per request; compare to the training baseline. |
vocab.json in another, deployed on different cadences. They drift, and nothing connects them.[CLS] or a new domain token shifts every ID after it. The text looks identical; the numbers all move.[UNK] (often ID 0). No crash — the model just gets a near-meaningless token and shrugs.A sentiment service ships fine for months. A data engineer updates vocab.json to add three domain tokens for a new product line and redeploys only the tokenizer artifact — the model weights are untouched. Adding those tokens shifts every later ID, so "unhappiness" now encodes to [412, 7340, 0] instead of the trained [412, 1899, 305]: a shifted middle token and an [UNK] tail. Live accuracy sags from 91% to 64%, yet every request returns 200 OK and offline eval (which still uses the old tokenizer) looks perfect. On-call spends a day chasing the model before someone diffs the served token IDs against training and finds the tokenizer was bumped without retraining.
Online accuracy dropped sharply, but there are no errors and offline eval still looks great. What is the most likely cause?