So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).
I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the α-ReLU loss (from Speeding Up Entmax). The α-ReLU loss is a proper scoring rule based off of the Tsallis entropy.
My results were the following:
Letter
Score Rule
Top-K
Control
Test T=1.0
Test T=0.8
Test T=0.1
c
Cross-entropy
200
3.09%
5.29%
6.64%
7.26%
c
α-ReLU
200
3.41%
4.66%
4.75%
4.33%
c
α-ReLU
∞
3.41%
3.87%
4.00%
3.49%
All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, α-ReLU is parameterized by α which controls the Tsallis exponent, and τ, which is a constant shift in the ReLU. I chose to set α=1.5 and τ=0 for all the experiments (which is the default for the α-ReLU python library).
The α-ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with “c”. And, when the top-k is set to ∞, most of the difference between it and the control also disappears.
So, perhaps we should just be training with a different scoring rule!
EDIT:
I realized I was calculating temperature unfairly for the α-ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:
Thanks for trying this! I wonder if this is making things worse in a similar way to top-k. The C-tokenizer makes it very likely that “c” is always in the top 200 tokens. I wonder if it’s also ensuring that it’s rarely sufficiently uncertain to be filtered by this scoring rule?
So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).
I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the α-ReLU loss (from Speeding Up Entmax). The α-ReLU loss is a proper scoring rule based off of the Tsallis entropy.
My results were the following:
All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, α-ReLU is parameterized by α which controls the Tsallis exponent, and τ, which is a constant shift in the ReLU. I chose to set α=1.5 and τ=0 for all the experiments (which is the default for the α-ReLU python library).
The α-ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with “c”. And, when the top-k is set to ∞, most of the difference between it and the control also disappears.
So, perhaps we should just be training with a different scoring rule!
You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.
EDIT: I realized I was calculating temperature unfairly for the α-ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:
Thanks for trying this! I wonder if this is making things worse in a similar way to top-k. The C-tokenizer makes it very likely that “c” is always in the top 200 tokens. I wonder if it’s also ensuring that it’s rarely sufficiently uncertain to be filtered by this scoring rule?