joseph_c comments on Shorter Tokens Are More Likely

joseph_c 24 Aug 2025 19:16 UTC
21 points
−2
So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).

I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the $α$ -ReLU loss (from Speeding Up Entmax). The $α$ -ReLU loss is a proper scoring rule based off of the Tsallis entropy.

My results were the following:

Letter Score Rule Top-K Control Test T=1.0 Test T=0.8 Test T=0.1

c Cross-entropy 200 3.09% 5.29% 6.64% 7.26%

c $α$ -ReLU 200 3.41% 4.66% 4.75% 4.33%

c $α$ -ReLU $\infty$ 3.41% 3.87% 4.00% 3.49%

All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, $α$ -ReLU is parameterized by $α$ which controls the Tsallis exponent, and $τ$ , which is a constant shift in the ReLU. I chose to set $α = 1.5$ and $τ = 0$ for all the experiments (which is the default for the $α$ -ReLU python library).

The $α$ -ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with “c”. And, when the top-k is set to $\infty$ , most of the difference between it and the control also disappears.

So, perhaps we should just be training with a different scoring rule!

You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.

EDIT: I realized I was calculating temperature unfairly for the $α$ -ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:

Letter Score Rule Top-K Control Test T=1.0 Test T=0.8 Test T=0.1

c Cross-entropy 200 3.09% 5.29% 6.64% 7.26%

c $α$ -ReLU 200 3.41% 4.66% 7.54% 31.66%

c $α$ -ReLU $\infty$ 3.41% 3.87% 5.94% 31.66%
- Brendan Long 24 Aug 2025 20:04 UTC
  4 points
  0
  Parent
  Thanks for trying this! I wonder if this is making things worse in a similar way to top-k. The C-tokenizer makes it very likely that “c” is always in the top 200 tokens. I wonder if it’s also ensuring that it’s rarely sufficiently uncertain to be filtered by this scoring rule?

Letter	Score Rule	Top-K	Control	Test T=1.0	Test T=0.8	Test T=0.1
c	Cross-entropy	200	3.09%	5.29%	6.64%	7.26%
c	$α$ -ReLU	200	3.41%	4.66%	4.75%	4.33%
c	$α$ -ReLU	$\infty$	3.41%	3.87%	4.00%	3.49%