If this is the case, isn’t the most straightforward test to take a pretrained open source LLM (e.g. Gemma 3, GPT-OSS, Llama, etc) and check its outputs logits? For example, the completion for The capital of France is should be Paris, so either we see the entire word token, or just P, and we can check their relative likelihood.
I feel like in reality this doesn’t happen (at least when I have had occasion to check logits) because the difference between a single small letter and a whole word is that the latter “collapses” the amount of possible branching more definitively. Also, it wouldn’t be too hard to add a small regularization term to reward the choice of longer tokens when possible (though I don’t know if any models are actually trained like that).
I think I was explaining this in a confusing way, so I added another footnote. Does this help?
The idea is that if the model is trained to use shorter tokens (like it does in many cases like non-, un-, anti-, etc.) it will be biased to use those tokens more than it should. So in the Paris case, I would expect most LLMs with a reasonable vocab size to have a “Paris” token, and they wouldn’t be trained to use “P”.
Based on my experiments above, I think if you did force P to be tokenized separately, the probability of P would be higher and the model would be biased to respond with capital names starting with P.
I’m probably also misunderstanding, but wouldn’t this predict that large production models prefer words starting with “a” and names starting with “I” (capital “i”)? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens ” a” and ” I” and that the model is incentivized to use them.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.
If this is the case, isn’t the most straightforward test to take a pretrained open source LLM (e.g. Gemma 3, GPT-OSS, Llama, etc) and check its outputs logits? For example, the completion for
The capital of France is
should beParis
, so either we see the entire word token, or justP
, and we can check their relative likelihood.I feel like in reality this doesn’t happen (at least when I have had occasion to check logits) because the difference between a single small letter and a whole word is that the latter “collapses” the amount of possible branching more definitively. Also, it wouldn’t be too hard to add a small regularization term to reward the choice of longer tokens when possible (though I don’t know if any models are actually trained like that).
I think I was explaining this in a confusing way, so I added another footnote. Does this help?
The idea is that if the model is trained to use shorter tokens (like it does in many cases like non-, un-, anti-, etc.) it will be biased to use those tokens more than it should. So in the Paris case, I would expect most LLMs with a reasonable vocab size to have a “Paris” token, and they wouldn’t be trained to use “P”.
Based on my experiments above, I think if you did force P to be tokenized separately, the probability of P would be higher and the model would be biased to respond with capital names starting with P.
I’m probably also misunderstanding, but wouldn’t this predict that large production models prefer words starting with “a” and names starting with “I” (capital “i”)? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens ” a” and ” I” and that the model is incentivized to use them.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.