Hofstadter’s article is very non-specific about why their examples prove much. (Young children often produce complete nonsensical statements and we don’t take that as evidence that they aren’t (going to be) generally intelligent.) But here’s an attempted concretization of Hofstadter’s argument which would render the counterargument in the post invalid (in short: because it makes it much easier for GPT by announcing that some of the questions will be nonsense).
I take it that when asked a question, GPT’s prior (i.e., credence before looking at the question) on “the question is nonsense” is very low, because nonsense questions are rare in the training data. If you have a low prior on “the question is nonsense” this means that you need relatively strong evidence to conclude that the question is nonsense. If you have an understanding of the subject matter (e.g., you know what it means to transport something across a bridge, how wide bridges are, you know what type of object a country is, how large it is, etc.), then it’s easy to overcome a very low prior. But if you don’t understand the subject matter, then it’s difficult to overcome the prior. For example, in a difficult exam on an unfamiliar topic, I might never come to assign >5% to a specific question being a nonsense/trick question. That’s probably why occasionally asking trick questions in exams works—students who understand can overcome the low prior, students who don’t understand cannot. Anyway, I take it that this is why Hofstadter concludes that GPT doesn’t understand the relevant concepts. If it could, it would be able to overcome the low prior of being asked nonsense questions.
(Others in the comments here give some arguments against this, saying something to the extent that GPT isn’t trying at all to determine whether a given question is nonsense. I’m skeptical. GPT is trying to predict continuations in human text, so it’s in particular trying to predict whether a human would respond to the question by saying that the question is nonsense.)
Now the OP asks GPT specifically whether a given question is nonsense or not. This, I assume, should cause the model to update toward a much higher credence on getting a nonsense question (before even looking at the question!). (Probably the prior without the announcement is something like 1/10000 (?) and the credence after the announcement is on the order of 1⁄3.) But getting from a 1⁄3 credence to a, say, 80% credence (and thus a “yo be real” response) is much easier than getting there from a 1/10000 credence. Therefore, it’s much less impressive that the system can detect nonsense when it already has a decently high credence in nonsense. In particular, it’s not that difficult to imagine getting from a 1⁄3 credence to an 80% credence with “cheap tricks”. (For example, if a sentence contains words from very different domains—say, “prime number” from number theory and “Obama” from politics—that is some evidence that the question is nonsense. Some individual word combinations are rare in sensible text, e.g., “transport Egypt” probably rarely occurs in sensible sentences.)
Here’s another way of putting this: Imagine you had to code a piece of software from scratch that has to guess whether a given question is nonsense or not. Now imagine two versions of the assignment: A) The test set consists almost exclusively of sensible questions. B) The test set has similarly many sensible and nonsense questions. (In both sets, the nonsense questions are as grammatical, etc. as the sensible questions.) You get paid for each question in the test set that you correctly classify. But then also in both cases someone, call her Alice, will look at what your system does for five randomly sampled sensible and five randomly sampled nonsense questions. They then write an article about in The Economist, or LessWrong, or the like. You don’t care about what Alice writes. In variant A, I think it’s pretty difficult to get a substantial fraction of the nonsense questions right, without doing worse than always guessing “sensible”. In fact, I wouldn’t be surprised if after spending months on this, I’d unable to outperform always guessing “sensible”. In any case, I’d imagine that Alice will write a scathing article about my system in variant A. But in variant B, I’d immediately have lots of ideas that are all based on cheap tricks (like the ones mentioned at the end of the preceding paragraph). I could imagine that one could achieve reasonably high accuracy on this, by implementing hundreds of cheap tricks and then aggregating them with linear regression or the like. So I’d also be much more hopeful about positive comments from Alice for the same amount of effort.
Hofstadter’s article is very non-specific about why their examples prove much. (Young children often produce complete nonsensical statements and we don’t take that as evidence that they aren’t (going to be) generally intelligent.) But here’s an attempted concretization of Hofstadter’s argument which would render the counterargument in the post invalid (in short: because it makes it much easier for GPT by announcing that some of the questions will be nonsense).
I take it that when asked a question, GPT’s prior (i.e., credence before looking at the question) on “the question is nonsense” is very low, because nonsense questions are rare in the training data. If you have a low prior on “the question is nonsense” this means that you need relatively strong evidence to conclude that the question is nonsense. If you have an understanding of the subject matter (e.g., you know what it means to transport something across a bridge, how wide bridges are, you know what type of object a country is, how large it is, etc.), then it’s easy to overcome a very low prior. But if you don’t understand the subject matter, then it’s difficult to overcome the prior. For example, in a difficult exam on an unfamiliar topic, I might never come to assign >5% to a specific question being a nonsense/trick question. That’s probably why occasionally asking trick questions in exams works—students who understand can overcome the low prior, students who don’t understand cannot. Anyway, I take it that this is why Hofstadter concludes that GPT doesn’t understand the relevant concepts. If it could, it would be able to overcome the low prior of being asked nonsense questions.
(Others in the comments here give some arguments against this, saying something to the extent that GPT isn’t trying at all to determine whether a given question is nonsense. I’m skeptical. GPT is trying to predict continuations in human text, so it’s in particular trying to predict whether a human would respond to the question by saying that the question is nonsense.)
Now the OP asks GPT specifically whether a given question is nonsense or not. This, I assume, should cause the model to update toward a much higher credence on getting a nonsense question (before even looking at the question!). (Probably the prior without the announcement is something like 1/10000 (?) and the credence after the announcement is on the order of 1⁄3.) But getting from a 1⁄3 credence to a, say, 80% credence (and thus a “yo be real” response) is much easier than getting there from a 1/10000 credence. Therefore, it’s much less impressive that the system can detect nonsense when it already has a decently high credence in nonsense. In particular, it’s not that difficult to imagine getting from a 1⁄3 credence to an 80% credence with “cheap tricks”. (For example, if a sentence contains words from very different domains—say, “prime number” from number theory and “Obama” from politics—that is some evidence that the question is nonsense. Some individual word combinations are rare in sensible text, e.g., “transport Egypt” probably rarely occurs in sensible sentences.)
Here’s another way of putting this: Imagine you had to code a piece of software from scratch that has to guess whether a given question is nonsense or not. Now imagine two versions of the assignment: A) The test set consists almost exclusively of sensible questions. B) The test set has similarly many sensible and nonsense questions. (In both sets, the nonsense questions are as grammatical, etc. as the sensible questions.) You get paid for each question in the test set that you correctly classify. But then also in both cases someone, call her Alice, will look at what your system does for five randomly sampled sensible and five randomly sampled nonsense questions. They then write an article about in The Economist, or LessWrong, or the like. You don’t care about what Alice writes. In variant A, I think it’s pretty difficult to get a substantial fraction of the nonsense questions right, without doing worse than always guessing “sensible”. In fact, I wouldn’t be surprised if after spending months on this, I’d unable to outperform always guessing “sensible”. In any case, I’d imagine that Alice will write a scathing article about my system in variant A. But in variant B, I’d immediately have lots of ideas that are all based on cheap tricks (like the ones mentioned at the end of the preceding paragraph). I could imagine that one could achieve reasonably high accuracy on this, by implementing hundreds of cheap tricks and then aggregating them with linear regression or the like. So I’d also be much more hopeful about positive comments from Alice for the same amount of effort.