It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with “a” will usually prefer a tokenization longer than just “a”. For example, “and” is its own token so a model usually wouldn’t be trained to output [a, nd]. Also, it’s not just how common a token is, but how common it is in that particular situation. “a” and “I” being common words doesn’t mean they’re common prefixes to other words (although this bias might affect “a” because a lot of phrases start with “a …”).
In cases where a model is trying to output a name that isn’t a full-word token, I think it would be biased[1] to pick a name starting with a token that’s the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it’s not overwhelming. Even a huge change to the tokenizer only took the number of words starting with “c” from 4% to 10%.