To simplify Daniel’s point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we’ve created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can’t just optimize for absolute performance and expect the rest of it to fall into place.
(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)
As it happens, GPT-3 has not reached the same absolute performance because we’re just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.
To simplify Daniel’s point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we’ve created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can’t just optimize for absolute performance and expect the rest of it to fall into place.
(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)
As it happens, GPT-3 has not reached the same absolute performance because we’re just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.