The Problem With The Current State of AGI Definitions
The following includes a fictionalized account of a conversation had with professor Viliam Lisý at EAGx Prague, with most of the details just plain made up because I forgot how it actually went. Special thanks to professor Dušan D. Nešić, who I mistakenly thought I had this conversation with, and ended up providing useful feedback after a very confused discussion on WhatsApp. Credit also goes to Justis from LessWrong, who kindly provided some excellent feedback prior to publication. Any seemingly bad arguments presented are due to my flawed retelling, and are not Dušan’s, Justis’, or Viliam’s.
“AGI has already been achieved. We did it. PaLM has achieved general intelligence, game over, you lose.”
“On the contrary, PaLM has achieved nothing of the sort. It is as far from general intelligence as a rock is to a baby.”
“You are correct, of course. I completely concede the point, for the purpose of this conversation. Regardless, this brings up a very important question: What would count as “general intelligence” to you?”
“I’m not sure exactly what you’re asking.”
“What test could be performed which, if failed, would ensure (or at least make likely) that you were not dealing with an AGI, while if passed, would force you to say “yep, that’s an AGI all right”?”
Testing for a minimum viable AGI
The professor was quiet for a moment, deep in thought. Finally, he answered.
“If the AI can replace more than half of all jobs humans can currently do, then it is definitely an AGI—as an average human can do an average number of jobs after a finite training period, it should be no different for an Artificial General Intelligence.”
“Hmm. Your test is technically valid as an answer to my question, but it’s too exclusionary. What you are testing for is an AI with capabilities that would exceed those of any human being. There is not one individual on this earth, living or dead, who can do more than half of all jobs humans currently do, and certainly not one who can perform better than the average worker in that many fields. Your test would capture superintelligent AGI just fine, but it would fail at identifying human-level general intelligence. In a way, this test indicates a general conflation between superintelligence and AGI, which is clearly not correct, if we wish to consider ourselves an instance of a ‘general intelligence’.”
We parted ways that night considering, without resolution, what a “minimum” AGI test would look like, a test which would capture as many potential AGIs as possible without including false positives. We could not agree on, or even fully define, what testable properties an AGI must have at a minimum (or what a non-AGI can have at a maximum before you can no longer call its intelligence “narrow”). We also discussed how to kill everyone on Earth, but that’s a story for another day.
Why is the minimum viable AGI question worth asking?
When debating others, one of the most important steps in the discussion process is making sure that you understand the other person’s position. If you’ve misidentified where your disagreement lies, the debate won’t be productive.
One of the most important—and controversial—topics in AI safety is AGI timelines. When is AGI likely to arrive, and once it’s here, how long (if ever) will we have until it all goes FOOM? Resolving this question has important practical ramifications, both for the short and long term. If we can’t agree on what we mean when we say “AGI,” debating AGI timelines becomes meaningless exchanges of words, with neither side understanding the other. I’ve seen people argue that AGI will never exist, and even if we can get an AI to do everything a human can do, that won’t be “true” general intelligence. I’ve seen people say that Gato is a general intelligence, and we are living in a post-AGI world as I type this. Both of these people may make the exact same practical predictions on what the next few years will look like, but will give totally different answers when asked about AGI timelines! This leads to confusion and needless misunderstandings, which I think all parties would rather avoid. As such, I would like to suggest a set of standardized, testable definitions for talking about AGI. We can have different levels of generality, but they must be clearly distinguishable, with different labels, to ensure we’re all on the same page here.
Following is a list of suggestions for these different definitional levels of AGI. I invite discussion and criticism, and would like to eventually see a “canonical” list which we can all refer back to for future discussion. This would preferably be ultimately published in a journal or preprint by a trustworthy figure in AI research, to facilitate easy citation. If you are someone who would be able to help publish something like this, please reach out to me. Consider the following a very rough, incomplete draft of what such a list might look like.
Partial List of (Mostly Testable) AGI Definitions
“Nano AGI” — Qualifies if it can perform above random chance (at a statistically significant level) on a multiple choice test found online it was not explicitly trained on.
“Micro AGI” — Qualifies if it can reach either State of The Art (SOTA) or human-level on two or more AI benchmarks which have been mentioned in 10+ papers published in the past year, and which were not explicitly present in its training data.
“Yitzian AGI” — Qualifies if it can perform at the level of an average human or above on multiple (2+) tests which were originally designed for humans, and which were not explicitly present in its training data.
“OG Turing AGI” — Qualifies if it can “pass” as a woman in a chat room (with a non-expert tester) for ten minutes, with a success rate higher than a randomly selected cisgender American male.
“Weak Turing AGI” — Qualifies if it can pass a 10-minute text-based Turing test where the judges are randomly selected Americans.
“Standard Turing AGI” — Qualifies if it can reliably pass a Turing test of the type that would win the Loebner Silver Prize.
“Gold Turing AGI” — Qualifies if it can reliably pass a 2-hour Turing test of the type that would win the Loebner Gold Prize.
“Truck AGI” — Qualifies if it can successfully drive a truck from the East Coast to the West Coast of America.
“Book AGI” — Qualifies if it can write a 200+ page book (using a one-paragraph-or-less prompt) which makes it to the New York Times Bestseller list.
“Anthonion AGI” — Qualifies if it is A) Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize, B) Able to score 90% or more on a robust version of the Winograd Schema Challenge (e.g. the “Winogrande” challenge or comparable data set for which human performance is at 90+%), C) Able to score 75th percentile (as compared to the corresponding year’s human students) on the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data, D) Able to learn the classic Atari game “Montezuma’s revenge” (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play.
“Barnettian AGI” — Qualifies if it is A) Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files during the course of their conversation, B) Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a circa-2021 Ferrari 312 T4 1:8 scale automobile model, C) Achieve at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al., D) Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al.
“Lawyer AGI” — Qualifies if it can win a formal court case against a human lawyer, where it is not obvious how the case will resolve beforehand.
“Lisy-Dusanian AGI” — Qualifies if it can replace more than half of all jobs humans can currently do.
“Lisy-Dusanian+ AGI” — Qualifies if it can replace all jobs humans can currently do in a cost-effective manner.
“Hyperhuman AGI” — Qualifies if there is nothing any human can do (using a computer) that it cannot do.
“Impossible AGI” — never qualifies; no silicon-based intelligence will ever be truly general enough.
As for my personal opinion, I think that all of these definitions are far from perfect. If we set a definitional standard for AGI that we ourselves cannot meet, then such a definition is clearly too narrow. A plausible definition of “general intelligence” must include the vast majority of humans, unless you’re feeling incredibly solipsistic. But yet almost all of the above tests (with the exception of Turing’s) cannot be passed by the vast majority of humans alive! Clearly, our current tests are too exclusionary, and I would like to see an effort to create a “maximally inclusive test” for general intelligence which the majority of humans would be able to pass. Is Turing’s criteria as inclusive as we can go, or is it possible to improve it further without including clearly non-intelligent entities as well? I hope this post will encourage further thought on the matter, if nothing else.
If you want to add or change anything please let me know in the comments! I will strike out prior versions of names/descriptions if anything changes, so they can be referred back to.
Past work on AGI definitions exist, of course, often in the context of prediction markets and general AI benchmarks. I am not an expert in the field, and expect to have missed many important technical definitions. My wording may also be unacceptably impercise at times. As such, I expect that I will ultimately need to partner with an expert to make a list which can be practically usable for formal researchers.
One designed for humans to answer, and picked more-or-less arbitrarily from the plethora of multiple-choice tests easily searchable on Google.
This is just to ensure that the benchmarks aren’t being created for the purpose of passing this test, but they can be older, “easy” benchmarks, as long as they’re still being actively cited in current literature.
This was originally “qualifies if it can beat random chance at multiple (2+) tasks,” a much weaker test, but it was pointed out to me that the definition could be interpreted in a whole bunch of contradictory ways, and is almost trivially weak. I’m also still not fully sure how to define “tasks.”
This is based on the original paper laying out the Turing test, which is actually quite interesting (and almost certainly queer-coded, imo!), and is worth an in-depth essay of its own.
Adapted from https://arxiv.org/abs/1705.08807
Taken almost word-for-word from Anthony’s excellent Metacalcus question here:
Taken almost word-for-word from Matthew_Barnett’s excellent Metacalcus question here:
An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor.
or the equivalent of a
Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.
This was vaguely inspired by https://sd-marlow.medium.com/cant-finish-what-you-don-t-start-7532078952d2, in particular the line:
understanding criminal law, or the history of the Roman Empire, is more of an application of AGI
From Ray Kurzweil ’s 1992 book The Age of Intelligent Machines.
With due respect to Kurzweil, I think his definition is rather flawed, to be honest (personal rant incoming). Name me a single human who can “successfully perform any intellectual task that a human being can.” Try to find even one person who is successful at any task anyone can possibly do. Such a person does not exist. All humans are better in some areas and worse in others, if only because we do not have infinite time to learn every possible skillset (though to be fair, most other definitions on this list run into the same issue). See the closing paragraph of this post for more.