Question
After a routine dependency bump, your NLP intent-classification serving service's accuracy quietly drops: support routing mislabels a noticeably higher share of tickets, though the service returns 200s with normal latency and confidence scores still look 'high'. Dashboards: a library upgrade in the serving image bumped the tokenizer package a minor version; the model was trained and exported against the previous tokenizer version; spot-checks show the new tokenizer splits certain tokens (numbers, hyphenated words, emoji) differently, so the input token IDs fed to the model differ from training. The model file itself is unchanged. How do you triage and respond?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.