Adam Jermyn comments on Language models seem to be much better than humans at next-token prediction

Adam Jermyn 15 Aug 2022 17:48 UTC
LW: 2 AF: 1
0
AF
Playing the perplexity game had a big impact on my intuitions around language models, so thanks for making it! In particular, the fact that models are so much better at it than humans means we can’t really tell from behavior alone whether a model is genuinely trying to predict the next token. This is a problem for detecting inner alignment failure, because we can’t tell (outside of the training set) if the model is actually optimizing for next-token prediction or something that just looks (to us) like next-token prediction.