If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences.
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)
If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)