GPT-3 has partially memorized a web corpus that probably includes a lot of basic physics questions and answers. Some of the physics answers in your interview might be the result of web search, pattern match, and context-sensitive paraphrasing. This is still an impressive task but is perhaps not the kind of reasoning you are hoping for?
From basic Q&A it’s pretty easy to see that GPT-3 sometimes memorizes not only words but short phrases like proper names, song titles, and popular movie quotes, and probably longer phrases if they are common enough.
Google’s Q&A might seem more magical too if they didn’t link to the source, which gives away the trick.
GPT-3 is still capable of reasoning if some of the answers were copied from the web. What you need for it to not be capable of reasoning is for all of the answers to have been copied from the web. Given its ability to handle random weird hypotheticals we just thought up, I’m pretty convinced at this point that it isn’t just pulling stuff from the web, at least not all the time.
Rather than putting this in binary terms (capable of reason or not), maybe we should think about what kinds of computation could result in a response like this?
Some kinds of reasoning would let you generate plausible answers based on similar questions you’ve already seen. People who are good at taking tests can get reasonably high scores on subjects they don’t fully comprehend, basically by bluffing well and a bit of luck. Perhaps something like that is going on here?
In the language of “Thinking, Fast and Slow”, this might be “System 1″ style reasoning.
Narrowing down what’s really going on probably isn’t going to be done in one session or by trying things casually. Particularly if you have randomness turned on, so you’d want to get a variety of answers to understand the distribution.
How should I modify the problems I gave it? What would be the least impressive test which would convince you it is reasoning, and not memorizing? (Preferably something that doesn’t rely on eg rhyming, since GPT-3 uses an obfuscating input encoding)
Anyway, my main issue is that you’re not defining what you mean by reasoning, even informally. What’s the difference between reasoning vs mere interpolation/extrapolation? A stab at a definition would make it a lot easier to differentiate.
One stab might be some kind of “semantic sensitivity”:
Some inputs are close in terms of edit distance, but very different semantically. One clue that a system can reason is if it can correctly respond to these small variations, and explain the difference.
This is part of why I tested similar situations with the bullet—I wanted to see whether small changes to the words would provoke a substantively different response.
I think another part of this is “sequential processing steps required”—you couldn’t just look up a fact or a definition somewhere, to get the correct response.
This is still woefully incomplete, but hopefully this helps a bit.
I like the second suggestion a lot more than the first. To me, the first is getting more at “Does GPT convert to a semantic representation, or just go based off of syntax?” I already strongly suspect it does something more meaningful than “just syntax”—but whether it then reasons about it is another matter.
GPT-3 has partially memorized a web corpus that probably includes a lot of basic physics questions and answers. Some of the physics answers in your interview might be the result of web search, pattern match, and context-sensitive paraphrasing. This is still an impressive task but is perhaps not the kind of reasoning you are hoping for?
From basic Q&A it’s pretty easy to see that GPT-3 sometimes memorizes not only words but short phrases like proper names, song titles, and popular movie quotes, and probably longer phrases if they are common enough.
Google’s Q&A might seem more magical too if they didn’t link to the source, which gives away the trick.
GPT-3 is still capable of reasoning if some of the answers were copied from the web. What you need for it to not be capable of reasoning is for all of the answers to have been copied from the web. Given its ability to handle random weird hypotheticals we just thought up, I’m pretty convinced at this point that it isn’t just pulling stuff from the web, at least not all the time.
Rather than putting this in binary terms (capable of reason or not), maybe we should think about what kinds of computation could result in a response like this?
Some kinds of reasoning would let you generate plausible answers based on similar questions you’ve already seen. People who are good at taking tests can get reasonably high scores on subjects they don’t fully comprehend, basically by bluffing well and a bit of luck. Perhaps something like that is going on here?
In the language of “Thinking, Fast and Slow”, this might be “System 1″ style reasoning.
Narrowing down what’s really going on probably isn’t going to be done in one session or by trying things casually. Particularly if you have randomness turned on, so you’d want to get a variety of answers to understand the distribution.
How should I modify the problems I gave it? What would be the least impressive test which would convince you it is reasoning, and not memorizing? (Preferably something that doesn’t rely on eg rhyming, since GPT-3 uses an obfuscating input encoding)
I know there are benchmarks for NL reasoning, but I’m not re-finding them so easily...
This looks like one:
https://github.com/facebookresearch/clutrr/
Anyway, my main issue is that you’re not defining what you mean by reasoning, even informally. What’s the difference between reasoning vs mere interpolation/extrapolation? A stab at a definition would make it a lot easier to differentiate.
One stab might be some kind of “semantic sensitivity”:
This is part of why I tested similar situations with the bullet—I wanted to see whether small changes to the words would provoke a substantively different response.
I think another part of this is “sequential processing steps required”—you couldn’t just look up a fact or a definition somewhere, to get the correct response.
This is still woefully incomplete, but hopefully this helps a bit.
I like the second suggestion a lot more than the first. To me, the first is getting more at “Does GPT convert to a semantic representation, or just go based off of syntax?” I already strongly suspect it does something more meaningful than “just syntax”—but whether it then reasons about it is another matter.