I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don’t really understand why you chose these experiments. It seems to me the things to check or prove are:
current tokenizers do actually tokenize typical training data so that short tokens are more common
current models do produce text that recapitulates this bias
how the k for topk-sampling exacerbates this bias depending on k
how this changes some typical completions
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
Yeah characterizing the impact on current models would definitely be interesting.
I think the toy models are interesting since the impact of top-k and temperature is straightforward in one sense (it makes likely tokens more likely), but LLMs are complicated and it’s possible that my theory about forcibly shortening tokens to trigger this wouldn’t have worked.
I was also surprised by how big the effect was (admittedly, with a really large change to the tokenizer).
I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don’t really understand why you chose these experiments. It seems to me the things to check or prove are:
current tokenizers do actually tokenize typical training data so that short tokens are more common
current models do produce text that recapitulates this bias
how the k for topk-sampling exacerbates this bias depending on k
how this changes some typical completions
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
Yeah characterizing the impact on current models would definitely be interesting.
I think the toy models are interesting since the impact of top-k and temperature is straightforward in one sense (it makes likely tokens more likely), but LLMs are complicated and it’s possible that my theory about forcibly shortening tokens to trigger this wouldn’t have worked.
I was also surprised by how big the effect was (admittedly, with a really large change to the tokenizer).