You can see in the experiment section that I’m not saying LLMs normally do this with words like “cat”. I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.
I was a little imprecise about shorter tokens, but I’m talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.
You’re right that I should probably explain this better in the intro though.
You can see in the experiment section that I’m not saying LLMs normally do this with words like “cat”. I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.
I was a little imprecise about shorter tokens, but I’m talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.
You’re right that I should probably explain this better in the intro though.