I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why—aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding “noisy” tokens.
I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why—aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding “noisy” tokens.