Language models are nearly AGIs but we don’t notice it because we keep shifting the bar

I’m putting my existing work on AI on Less Wrong, and editing as I go, in preparation to publishing a collection of my works on AI in a free online volume. If this content interests you, you could always follow my Substack, it’s free and also under the name Philosophy Bear.

Anyway, enjoy. Comments are appreciated as I will be rewriting parts of the essays before I put them out. A big thank you to user TAG who identified a major error in my previous post regarding the Chinese Room Thought experiment, which prompted its correction [in the addition that will go in the book] and a new corrections section for my Substack page.

Glossary:

GPT-3- a text-generating language model.

PaLM-540B- a stunningly powerful question-answering language model.

Great Palm- A hypothetical language model that combines the powers of GPT-3 and PaLM-540B. Probably buildable with current technology, a lot of money and a little elbow grease.

Great Palm with continuous learning (GPWCL)- A hypothetical language model that combines the capacities of GPT-3 and PaLM-540B, with an important additional capacity. Most language models work over a “window” of text, functioning as short-term memory. Their long-term memory is set by their training. Continuous learning is the capacity to keep adding to long-term memory as you go, and this would allow a language model to tackle much longer texts.

The argument

What I’ll be doing in this short essay is a bit cheeky, but I think we’ll make a few important points, viz:

  1. Goals that seem very concrete can turn out to be vulnerable to bar-shifting- shifting which we may scarcely even notice.

  2. AGI is such a goal.

  3. We have gotten very good, much too good, at denying the progress we have made in AGI.

  4. A focus on able-bodied humanity, and the tendency to forget disabled people exist when thinking about these topics, deceives us in these matters.

If I’m being a bit of a gadfly here, it’s not without a purpose.

Everything I say in this article in a sense maybe applies to GPT-3 alone, but for the avoidance of doubt, let me specify that I’m talking about a hypothetical language model that has the fluency of GPT-3 and the question-answering capabilities of PaLM-540B which we will call The Great Palm to make it clear that we’re not taking ourselves too seriously. In my view, The Great Palm is very close to being an AGI.

I think the Great Palm lacks only one thing, the capacity for continuous learning- the capacity to remember the important bits of everything it reads, and not just in its training period. If Great Palm (GPT-3+PaLM540B) had that ability, it would be an AGI.

“But hang on”, you say “Great Palm can’t draw, it can’t play computer games, it can’t listen to music, it can’t so much as discriminate an apple from a banana, and adding on a capacity for continuous learning doesn’t change that”.

I have two responses.

Response 1: Sure, but neither could noted author, activist, and communist intellectual Helen Keller and other completely deaf and blind people, who are all general intellects.

Response 2: Actually, it may be able to do some of these things so long as you can convert them into the modality of text. It’s quite conceivable that Great Palm could analyze music, for example, if the notation were converted into text. We should focus more on content than modality.

Why do I say that Great Palm with a capacity for continuous learning would be an artificial general intelligence? Because it can attempt basically all tasks a human with access to a text input, text output console and nothing more could and make a reasonable go at them. In the case of Great Palm with continuous learning, looking at what PaLM-540B and GPT-3 can do, it’s actually hard to find tasks that the average human can beat it. Look at the MMLU dataset if you don’t believe me- they’re tough questions). That kind of broad scope is comparable to the scope of many humans.

To be clear I am absolutely not saying that, for example, Helen Keller could only answer text input text output problems. There are numerous other sensory modalities-touch taste etc. Helen Keller could navigate a maze, whereas Great-Palm-With-Continuous learning could only do that if the maze were described to it. I suppose this gives a possible line of counterargument. We could disqualify Great-Palm-With-Continuous-Learning by adding a disjunction like “AGIs must be proficient in at least one of touch, taste, smell, sight or hearing”, but that seems arbitrary to me.

I’m not exactly going to proffer a definition of AGI here, but it seems to me that entities that can make a reasonable go at almost all text input text output tasks count as AGIs. At the very least, imposing the need to be able to use particular sensory modalities is not only wrongly human-centric, but it also doesn’t even account for all human experience (e.g. the deaf and blind).

Objections:

What about Commonsense reasoning: Maybe you’re worried about commonsense reasoning. Looking at PaLM’s capabilities, Its performance on commonsense reasoning tasks is human, or very close to it. For example, PaLM 540B scored ~96% on the Winograd Schema test. My recollection is that most humans don’t get this much, but the authors set the bar 100 because they reasoned a human properly paying attention would get full marks [at least I seem to recall that’s why they changed it to 100 between GLUE and superGLUE]. Requiring 100% of human performance on commonsense reasoning tasks to be an AGI seems to me like special pleading. Near enough is good enough to count.

What about the Turing test: Would the Great Palm continuous learning edition be able to pass the Turing test reliably? I don’t know. I’m confident it could pass it sometimes and I’m confident it could pass it more reliably than some humans- humans who are undoubtedly general intelligences. Language models have gotten very good at Turing tests after all.

Surely there are some tasks it cannot do: Is it not possible that there might be some tasks that humans can do that Great Palm with continuous learning (GPWCL) can’t do?: I’d say it’s probable! Nonetheless, the great bulk of tasks an average literate human could do, GPWCL can do- and it’s quite difficult to find counterexamples. I think that insisting that AGI requires a computer to be able to perform literally every task a literate human can do is special pleading. If we encountered aliens, for example, it’s quite likely that there would be some tasks the average human can do that the average alien couldn’t do (and vice versa) this wouldn’t exclude either of us from counting as AGI.

Haven’t you just arbitrarily drawn a line around text input, text output problems and said “being able to do the majority of these is enough for AGI”? Sure, definitions of AGI that exclude the deaf and the blind may be wrong, but that doesn’t prove text alone is sufficient. Maybe some third definition that includes Helen Keller, but excludes Great-Palm-With-Continuous-Learning is right: Ultimately, this will come down to definition debate. However when we focus on the content of problems rather than the modality, it becomes clear the range of text input, text output is vast, one might even say general.

What if there are other huge categories of text input text output tasks that Great Palm with continuous learning could not attempt that you are unaware of: Am I certain that continuous learning is the only thing holding something like Great Palm back from the vast bulk of literate-human accessible tasks? No, I’m not certain. I’m very open to counterexamples if you have any, put them in the comments. Nonetheless, PaLM can do a lot of things, GPT-3 can do a lot of things, and when you put them together, the only things that stand out to me as obviously and qualitatively missing in the domain of text input, and text output involve continuous learning.

Am I saying that text input text output is the only way to prove intelligence?: Absolutely not! The vast majority of humans who ever lived were illiterate. However, it seems general enough to me to qualify. It is sufficient, not necessary.

Aren’t you treating continuous learning as if it were a very easy problem, a negligible barrier when it fact it’s very hard?: That’s not my intention. I recognize that it is very hard. That said, at a guess, it is probably possible to make Great-Palm sans continuous learning now. Adding on the continuous learning component will take time, but I would be very surprised if it took anywhere near as much time as it took us to reach GPT-3 and PaLM-540B.

Implications

Turing proposed the Turing test as a test for something like AGI, but since then it seems the concept of AGI has somewhat metastasized. For example, Metaculus gives this as the requirements to qualify as a “weakly general” AGI:

  1. Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.

  2. Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the “Winogrande” challenge or comparable data set for which human performance is at 90+%

  3. Be able to score 75th percentile (as compared to the corresponding year’s human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)

  4. Be able to learn the classic Atari game “Montezuma’s revenge” (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)


And this as the definition of a strong AGI on Metaculus:

  1. Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of an AI passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition, so long as the test is well-designed to the estimation of Metaculus Admins.

  2. Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model. A single demonstration of this ability, or a sufficiently similar demonstration, will be considered sufficient.

  3. High competency at a diverse fields of expertise, as measured by achieving at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al..

  4. Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al. Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.

But to me, these aren’t really definitions of AGI. They’re definitions of visual, auditory and kinaesthetic sensory modality utilizing AGI. Putting this as the bar for AGI effectively excludes some disabled people from being general intelligences, which is not desirable! That alone makes it worth correcting. But it also has another undesirable effect. Adding this onto the concept of intelligence is a form of bar-shifting that prevents us from recognizing our progress. This sort of bar shifting is part of a general pattern of thought that means we keep being taken by surprise by our own achievements in machine learning.

Also, the second set of problems particularly, but to a certain degree the first as well, are much too hard. Almost no human being would pass all of the second set of problems. A solid majority would not past the first set. This also contributes to the bar-shifting problem. But that’s a matter for a different essay.

There’s an old joke in the field that intelligence is whatever it is that we can’t get computers to do at the moment. Let’s try to avoid that!