Tokenizers are often used over multiple generations of a model, or at least that was the case a couple of years ago, so I wouldn’t expect it to work well as a test.
Same tokenizers on different training data lead to different glitch tokens, see e. g. comparison of Llama-family models in Yuxi Li et al. 2024 https://arxiv.org/abs/2404.09894
According to Semianalysis:
Has anyone tried to test this hypothesis with the glitch token magic?
Tokenizers are often used over multiple generations of a model, or at least that was the case a couple of years ago, so I wouldn’t expect it to work well as a test.
Same tokenizers on different training data lead to different glitch tokens, see e. g. comparison of Llama-family models in Yuxi Li et al. 2024 https://arxiv.org/abs/2404.09894
Good point! I hadn’t quite realized that although it seems obvious in retrospect.