Shorter common tokens are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete.
I don’t think this is right. LLMs are trained with teacher forcing on a fixed tokenization. For common words the tokenizer provides a single long token (e.g., “ the”, “ and”, “ cat”). The model is trained to put probability on that token—not on a shorter prefix like “ c”. So a short token does not “inherit the combined probability of any word it could complete.” Those completions were usually seen as longer tokens during training.
You can check this by running inference on an LM; if you ask the model to complete “hello” it will put ” world” above ” ” even though both are typically tokens.
On a quick skim, I think that the rest of your arguments are plausible.
You can see in the experiment section that I’m not saying LLMs normally do this with words like “cat”. I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.
I was a little imprecise about shorter tokens, but I’m talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.
You’re right that I should probably explain this better in the intro though.
I don’t think this is right. LLMs are trained with teacher forcing on a fixed tokenization. For common words the tokenizer provides a single long token (e.g., “ the”, “ and”, “ cat”). The model is trained to put probability on that token—not on a shorter prefix like “ c”. So a short token does not “inherit the combined probability of any word it could complete.” Those completions were usually seen as longer tokens during training.
You can check this by running inference on an LM; if you ask the model to complete “hello” it will put ” world” above ” ” even though both are typically tokens.
On a quick skim, I think that the rest of your arguments are plausible.
You can see in the experiment section that I’m not saying LLMs normally do this with words like “cat”. I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.
I was a little imprecise about shorter tokens, but I’m talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.
You’re right that I should probably explain this better in the intro though.