The problem is actually much worse than that. This setup doesn’t even remotely come close to testing true human next-token prediction ability (which is very likely at least on par with current LM ability).
It’s like trying to test the ability of the human visual cortex to predict distributions over near future visual patterns by devising some complicated multiple choice symbolic shape prediction game and then evaluating humans on this computer game few-shot after very little training.
The human brain internally has powerful sensory stream prediction capabilities which underlies much of it’s strong fast generalization performance on downstream tasks, just like ML systems. Internally the human brain is trained on something very similar to token sequence prediction, but that does not mean that you can evaluate true brain module performance through some complex high level reasoning game that happens to nominally also involve the same token sequence prediction.
Imagine hooking GPT3 up to a robot body with a vision system, adding some new motor vision modules etc, and then asking it to play this very same computer game without any training, and then declaring that GPT3 in fact had poor next-token prediction ability. It would take significant additional training for the high level reasoning and motor systems to learn the value of the GPT3 module, how to decode it’s outputs, wire it all up, etc. This doesn’t actually measure the desired capability at all.
I think this problem would be solved with additional training of the humans. If a human spent years (months? Days?) practicing, they would learn to “use the force” and make use of their abilities. (Analogous to how we learn to ride bikes intuitively, or type, or read words effortlessly instead of by carefully looking at the letters and sounding them out.) Literally the neurons in your brain would rewire to connect the text-prediction parts to the playing-this-game parts.
Of course, but that’s a very expensive experiment. A much cheaper and actually useful comparison experiment would be to train a multimodal AI on the exact same audiovisual prediction game the human is playing, and then compare their perplexity / complexity (flops or params) training curves.
An even cheaper version would be to have the humans spend a few days practicing and measure how much performance improvement came from that, and then extrapolate. Have a log scale of practice time, and see how much performance each doubling of practice time gets you...
Wait a minute hold on, I disagree that this is a problem at all… isn’t it a fully general counterargument to any human-vs.-AI comparison?
Like, for example, consider AlphaGo vs. humans. It’s now well-established that AlphaGo is superhuman at Go. But it had the same “huge advantage” over humans that GPT does in the text prediction game! If AlphaGo had been hooked up to a robot body that had to image-recognize the board and then move pieces around, it too would have struggled, to put it mildly.
Here’s another way of putting it:
There’s this game, the “predict the next token” game. AIs are already vastly superhuman at this game, in exactly the same way that they are vastly superhuman at chess, go, etc. However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text, and those subnetworks just aren’t hooked up in the right way to the actual human behavior / motor outputs to get humans to be good at this game yet.
The only point of this token prediction game is as some sort of rough proxy to estimate human brain token prediction ability. I think you may now agree it’s horrible at that, as it would require unknown but significant human training time to unlock the linguistic prediction ability the cortex already has. Human’s poor zero-shot ability at this specific motor visual game (which is the only thing this post tested!) does not imply the human brain doesn’t have powerful token prediction ability (as I predict you already agree, or will shortly).
I highly doubt anybody is reading this and actually interested in the claim that AI is superhuman at this specific weird token probability game and that alone. Are you? The key subtext here—all of the interest—is in the fact that this is a core generic proxy task such that from merely learning token prediction a huge number of actually relevant downstream tasks emerge near automatically.
There is nothing surprising about this—we’ve know for a long long time (since AIXI days), that purely learning to predict the sensory stream is in fact the only universal necessary and sufficient learning task for superintellligence!
AlphaGo vs humans at go is very different in several key respects: firstly (at least some) humans actually have non trivial training (years) in the game itself, so we can more directly compare along more of the training curve. Secondly, Go is not a key component subtask for many economically relevant components of human intelligence in the same way that token prediction is a core training proxy or subtask of core linguistic abilities.
However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text,
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
“However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text,”
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
Whoa, hold up. It’s one thing to say that the literature proves that the human brain is doing text prediction. It’s another thing entirely to say that it’s doing it better than GPT-3. What’s the argument for that claim, exactly? I don’t follow the reasoning you give above. It sounds like you are saying something like this:
”Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
“Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.
I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
Human attempts to distinguish between 5-token GPT-3 continuation and the truth
GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.
The problem is actually much worse than that. This setup doesn’t even remotely come close to testing true human next-token prediction ability (which is very likely at least on par with current LM ability).
It’s like trying to test the ability of the human visual cortex to predict distributions over near future visual patterns by devising some complicated multiple choice symbolic shape prediction game and then evaluating humans on this computer game few-shot after very little training.
The human brain internally has powerful sensory stream prediction capabilities which underlies much of it’s strong fast generalization performance on downstream tasks, just like ML systems. Internally the human brain is trained on something very similar to token sequence prediction, but that does not mean that you can evaluate true brain module performance through some complex high level reasoning game that happens to nominally also involve the same token sequence prediction.
Imagine hooking GPT3 up to a robot body with a vision system, adding some new motor vision modules etc, and then asking it to play this very same computer game without any training, and then declaring that GPT3 in fact had poor next-token prediction ability. It would take significant additional training for the high level reasoning and motor systems to learn the value of the GPT3 module, how to decode it’s outputs, wire it all up, etc. This doesn’t actually measure the desired capability at all.
I think this problem would be solved with additional training of the humans. If a human spent years (months? Days?) practicing, they would learn to “use the force” and make use of their abilities. (Analogous to how we learn to ride bikes intuitively, or type, or read words effortlessly instead of by carefully looking at the letters and sounding them out.) Literally the neurons in your brain would rewire to connect the text-prediction parts to the playing-this-game parts.
Of course, but that’s a very expensive experiment. A much cheaper and actually useful comparison experiment would be to train a multimodal AI on the exact same audiovisual prediction game the human is playing, and then compare their perplexity / complexity (flops or params) training curves.
An even cheaper version would be to have the humans spend a few days practicing and measure how much performance improvement came from that, and then extrapolate. Have a log scale of practice time, and see how much performance each doubling of practice time gets you...
Yeah that would be interesting.
Wait a minute hold on, I disagree that this is a problem at all… isn’t it a fully general counterargument to any human-vs.-AI comparison?
Like, for example, consider AlphaGo vs. humans. It’s now well-established that AlphaGo is superhuman at Go. But it had the same “huge advantage” over humans that GPT does in the text prediction game! If AlphaGo had been hooked up to a robot body that had to image-recognize the board and then move pieces around, it too would have struggled, to put it mildly.
Here’s another way of putting it:
There’s this game, the “predict the next token” game. AIs are already vastly superhuman at this game, in exactly the same way that they are vastly superhuman at chess, go, etc. However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text, and those subnetworks just aren’t hooked up in the right way to the actual human behavior / motor outputs to get humans to be good at this game yet.
The only point of this token prediction game is as some sort of rough proxy to estimate human brain token prediction ability. I think you may now agree it’s horrible at that, as it would require unknown but significant human training time to unlock the linguistic prediction ability the cortex already has. Human’s poor zero-shot ability at this specific motor visual game (which is the only thing this post tested!) does not imply the human brain doesn’t have powerful token prediction ability (as I predict you already agree, or will shortly).
I highly doubt anybody is reading this and actually interested in the claim that AI is superhuman at this specific weird token probability game and that alone. Are you? The key subtext here—all of the interest—is in the fact that this is a core generic proxy task such that from merely learning token prediction a huge number of actually relevant downstream tasks emerge near automatically.
There is nothing surprising about this—we’ve know for a long long time (since AIXI days), that purely learning to predict the sensory stream is in fact the only universal necessary and sufficient learning task for superintellligence!
AlphaGo vs humans at go is very different in several key respects: firstly (at least some) humans actually have non trivial training (years) in the game itself, so we can more directly compare along more of the training curve. Secondly, Go is not a key component subtask for many economically relevant components of human intelligence in the same way that token prediction is a core training proxy or subtask of core linguistic abilities.
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
Whoa, hold up. It’s one thing to say that the literature proves that the human brain is doing text prediction. It’s another thing entirely to say that it’s doing it better than GPT-3. What’s the argument for that claim, exactly? I don’t follow the reasoning you give above. It sounds like you are saying something like this:
”Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
How do you rule out the possibility that:
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.
I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
Human attempts to distinguish between 5-token GPT-3 continuation and the truth
GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.