I’m probably also misunderstanding, but wouldn’t this predict that large production models prefer words starting with “a” and names starting with “I” (capital “i”)? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens ” a” and ” I” and that the model is incentivized to use them.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.
I’m probably also misunderstanding, but wouldn’t this predict that large production models prefer words starting with “a” and names starting with “I” (capital “i”)? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens ” a” and ” I” and that the model is incentivized to use them.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.