“Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
How do you rule out the possibility that:
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.