Shouldn’t this be generally “likely tokens are even more likely”? I think it’s not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.
I suspect this has something to do with “LLM style”. LLMs may be pushed to select “slop” words because those words have more possible endings, even if none of those endings are the best one.
My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn’t notice a systematic reduction of “LLM style” but I’d be keen to see the experiment.)
Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!
Shouldn’t this be generally “likely tokens are even more likely”?
I thought focusing on short tokens would be interesting since “make likely tokens more likely” is just temperature scaling doing its job, but I think the interaction with token length is surprising.
Shouldn’t this be generally “likely tokens are even more likely”? I think it’s not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.
My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn’t notice a systematic reduction of “LLM style” but I’d be keen to see the experiment.)
Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!
I thought focusing on short tokens would be interesting since “make likely tokens more likely” is just temperature scaling doing its job, but I think the interaction with token length is surprising.