some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
How do you rule out the possibility that:
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).