When people wrote text, they didn’t do it statelessly. Sure they physically wrote one word at a time, but people accumulate their understanding as they write, and they go back to edit with some frequency.
If you took the writers of your favorite books, and forced them to write another book, except this time you forced them to write it autoregressively, and after every token they outputted you waited a day and swapped the writer, probably the overall quality of that output would be garbage.
So I don’t think the test is wrong per se, in that I think this is a fair measure that NNs are better at stateless next-token prediction than humans, but I think it’s also not true that stateless human next token prediction is the benchmark you should be comparing to.
As was mentioned or hinted at on another comment here, but bears explicit emphasis: the perplexity of the actual generation process of human text is just the perplexity of the actual distribution of generated human text, were the dataset of infinite size.
When people wrote text, they didn’t do it statelessly. Sure they physically wrote one word at a time, but people accumulate their understanding as they write, and they go back to edit with some frequency.
If you took the writers of your favorite books, and forced them to write another book, except this time you forced them to write it autoregressively, and after every token they outputted you waited a day and swapped the writer, probably the overall quality of that output would be garbage.
So I don’t think the test is wrong per se, in that I think this is a fair measure that NNs are better at stateless next-token prediction than humans, but I think it’s also not true that stateless human next token prediction is the benchmark you should be comparing to.
As was mentioned or hinted at on another comment here, but bears explicit emphasis: the perplexity of the actual generation process of human text is just the perplexity of the actual distribution of generated human text, were the dataset of infinite size.