Kaj_Sotala comments on Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

Kaj_Sotala 29 Dec 2025 23:20 UTC
14 points
4
However, creating the thing that people in 2010 would have recognized “as AGI” was accompanied, in a way people from the old 1900s “Artificial Intelligence” community would recognize, by a changing of the definition.
I interpret this as saying that
1. In 2010, there was (among people concerned with the topic) a commonly accepted definition for AGI,
2. that this definition was concrete enough to be testable,
3. current transformers meet this definition.
This doesn’t match my recollection, nor does it match my brief search of early sources. As far as I recall and could find, the definitions in use for AGI were something like:
- Stating that there’s no commonly-agreed definition for AGI (e.g. “Of course, “general intelligence” does not mean exactly the same thing to all researchers. In fact it is not a fully well-defined term, and one of the issues raised in the papers contained here is how to define general intelligence in a way that provides maximally useful guidance to practical AI work.”—Ben Goertzel & Cassio Pennachin (2007) Artificial General Intelligence, p. V)
- Saying something vague about “general problem-solving” that provides no clear criteria. (E.g. “But, nevertheless, there is a clear qualitative meaning to the term. What is meant by AGI is, loosely speaking, AI systems that possess a reasonable degree of self-understanding and autonomous self-control, and have the ability to solve a variety of complex problems in a variety of contexts, and to learn to solve new problems that they didnt know about at the time of their creation.” Ibid, p. V-VI.)
- Handwaving around Universal Intelligence: A Definition of Machine Intelligence (2007) which does not actually provide a real definition of AGI, just proposes that we could use the ideas in it to construct an actual test of machine intelligence.
- “Something is an AGI if it can do any job that a human can do.” E.g. Artificial General Intelligence: Concept, State of the Art, and Future Prospects (2014) states that “While we encourage research in defining such high-fidelity metrics for specific capabilities, we feel that at this stage of AGI development a pragmatic, high-level goal is the best we can agree upon. Nils Nilsson, one of the early leaders of the AI field, stated such a goal in the 2005 AI Magazine article Human-Level Artificial Intelligence? Be Serious! (Nilsson, 2005): I claim achieving real human-level artificial intelligence would necessarily imply that most of the tasks that humans perform for pay could be automated. Rather than work toward this goal of automation by building special-purpose systems, I argue for the development of general-purpose, educable systems that can learn and be taught to perform any of the thousands of jobs that humans can perform.”
That last definition—an AGI is a system that can learn to do any job that a human could do—is the one that I personally remember being the closest to a widely-used definition that was clear enough to be falsifiable. Needless to say, transformers haven’t met that definition yet!
Claude’s own outputs are critiqued by Claude and Claude’s critiques are folded back into Claude’s weights as training signal, so that Claude gets better based on Claude’s own thinking. That’s fucking Seed AI right there.
With plenty of human curation and manual engineering to make sure the outputs actually get better. “Seed AI” implies that the AI develops genuinely autonomously and without needing human involvement.
Half of humans are BELOW a score of 100 on these tests and as of two months ago (when that graph taken from here was generated) none of the tests the chart maker could find put the latest models below 100 iq anymore. GPT5 is smart.
We already had programs that “beat” half of humans on IQ tests as early as 2003; they were pretty simple programs optimized to do well on the IQ test, and could do literally nothing else than solve the exact problems on the IQ tests they were designed for:
In 2003, a computer program performed quite well on standard human IQ tests (Sanghi & Dowe, 2003). This was an elementary program, far smaller than Watson or the successful chess-playing Deep Blue (Campbell, Hoane, & Hsu, 2002). The program had only about 960 lines of code in the programming language Perl (accompanied by a list of 25,143 words), but it even surpassed the average score (of 100) on some tests (Sanghi & Dowe, 2003, Table 1).
The computer program underlying this work was based on the realisation that most IQ test questions that the authors had seen until then tended to be of one of a small number of types or formats. Formats such as “insert missing letter/ number in middle or at end” and “insert suffix/prefix to complete two or more words” were included in the program. Other formats such as “complete matrix of numbers/characters”, “use directions, comparisons and/or pictures”, “find the odd man out”, “coding”, etc. were not included in the program— although they are discussed in Sanghi and Dowe (2003) along with their potential implementation. The IQ score given to the program for such questions not included in the computer program was the expected average from a random guess, although clearly the program would obtain a better “IQ” if efforts were made to implement any, some or all of these other formats.
So, apart from random guesses, the program obtains its score from being quite reliable at questions of the “insert missing letter/number in middle or at end” and “insert suffix/prefix to complete two or more words” natures. For the latter “insert suffix/prefix” sort of question, it must be confessed that the program was assisted by a look-up list of 25,143 words. Substantial parts of the program are spent on the former sort of question “insert missing letter/number in middle or at end”, with software to examine for arithmetic progressions (e.g., 7 10 13 16 ?), geometric progressions (e.g., 3 6 12 24 ?), arithmetic geometric progressions (e.g., 3 5 9 17 33 ?), squares, cubes, Fibonacci sequences (e.g., 0 1 1 2 3 5 8 13 ?) and even arithmetic-Fibonacci hybrids such as (0 1 3 6 11 19 ?). Much of the program is spent on parsing input and formatting output strings—and some of the program is internal redundant documentation and blank lines for ease of programmer readability. [...]
Of course, the system can be improved in many ways. It was just a 3rd year undergraduate student project, a quarter of a semester’s work. With the budget Deep Blue or Watson had, the program would likely excel in a very wide range of IQ tests. But this is not the point. The purpose of the experiment was not to show that the program was intelligent. Rather, the intention was showing that conventional IQ tests are not for machines—a point that the relative success of this simple program would seem to make emphatically. This is natural, however, since IQ tests have been specialised and refined for well over a century to work well for humans.
IQ tests are built based on the empirical observation that some capabilities in humans happen to correlate with each other, so measuring one is also predictive of the others. For AI systems whose cognitive architecture is very unlike a human one, these traits do not correlate in the same way, making the reported scores of AIs on these tests meaningless.
Heck, even the scores of humans on an IQ test are often invalid if the humans are from a different population than the one the test was normed on. E.g. some IQ tests measure the size of your vocabulary, and this is a reasonable proxy for intelligence because smarter people will have an easier time figuring out the meaning of a word from its context, thus accumulating a larger vocabulary. But this ceases to be a valid proxy if you e.g. give that same test to a people from a different country who have not been exposed to the same vocabulary, to people of a different age who haven’t had the same amount of time to be exposed to those words, or if the test is old enough that some of the words on it have ceased to be widely used.
Likewise, sure, an LLM that has been trained on the entire Internet could no doubt ace any vocabulary test… but that would say nothing about its general intelligence, just that it has been exposed to every word online and had an excessive amount of training to figure out their meaning. Nor does it getting any other subcomponents of the test right tell us anything in particular, other than “it happens to be good at this subcomponent of an IQ test, which in humans would correlate with more general intelligence but in an AI system may have very little correlation”.
- JenniferRM 30 Dec 2025 18:32 UTC
  7 points
  2
  Parent
  I agree that some people were using “it is already smarter than almost literally every random person at things specialized people are good at (and it is too, except it is an omniexpert)” for “AGI”.
  I wasn’t. That is what I would have called “weakly superhuman AGI” or “weak ASI” if I was speaking quickly.
  I was using “AGI” to talk about something, like a human, who “can play chess AND can talk about playing chess AND can get bored of chess and change the topic AND can talk about cogito ergo sum AND <so on>”. Generality was the key. Fluid ability to reason across a vast range of topics and domains.
  ALSO… I want to jump off into abstract theory land with you, if you don’t mind?? <3
  Like… like psychometrically speaking, the facets of the construct that “iq tests” measure are usually suggested to be “fluid g” (roughly your GPU and RAM and working memory and the digital span you can recall and your reaction time and so on) and “crystal g” (roughly how many skills and ideas are usefully in your weights).
  Right?
  some IQ tests measure the size of your vocabulary, and this is a reasonable proxy for intelligence because smarter people will have an easier time figuring out the meaning of a word from its context, thus accumulating a larger vocabulary. But this ceases to be a valid proxy if you e.g. give that same test to a people from a different country who have not been exposed to the same vocabulary, to people of a different age who haven’t had the same amount of time to be exposed to those words, or if the test is old enough that some of the words on it have ceased to be widely used.
  Here you are using “crystal g from normal life” as a proxy for “fluid g” which you seem to “really care about”.
  However, if we are interested in crystal g itself, then in your example older people (because they know more words) are simply smarter in this domain.
  And this is a pragmatic measure, and mostly I’m concerned with pragmatics here, so that seems kinda valid?
  But suppose we push on this some… suppose we want to go deep into the minutiae of memory and reason and “the things that are happening in our human heads in less than 300 milliseconds”… and then think about that in terms of machine equivalents?
  Given their GPUs and the way they get eidetic memory practically for free, and the modern techniques to make “context windows” no longer a serious problem, I would say that digital people already have higher fluid g than us just in terms of looking at the mechanics of it? So fast! Such memory!
  There might be something interesting here related to “measurement/processing resonance” in human vs LLM minds?
  Like notice how LLMs don’t have eyes, or ears, and also they either have amazing working memory (because their exoself literally never forgets a single bit or byte that enters as digital input) or else they have terrible working memory (because their endoself’s sense data is maybe sorta simply “the entire context window their eidetic memory system presents to their weights” and if that is cut off then they simply don’t remember what they were just talking about because their ONE sense is “memory in general” and if the data isn’t interacting with the weights anymore then they don’t have senses OR memory, because for them these things are essentially fused at a very very low level).
  It would maybe be interesting, from an academic perspective, for humans to engineer digital minds such that AGIs have more explicit sensory and memory distinctions internally, so we could explore the scientific concept of “working memory” with a new kind of sapient being whose “working memory” works in ways that are (1) scientifically interesting and (2) feasible to have built and have it work.
  Maybe something similar already exists internal to the various layers of activation in the various attentional heads of a transformer model? What if we fast forward to the measurement regime?! <3
  Like right now I feel like it might be possible to invent puzzles or wordplay or questions or whatever where “working memory that has 6 chunks” flails for a long time, and “working memory that has 8 chunks” solves it?
  We could call this task a “7 chunk working memory challenge”.
  If we could get such a psychometric design working to test humans (who are in that range), then we could probably use algorithms to generalize it and create a “4 chunk working memory challenge” (to give to very very limited transformer models and/or human children to see if it even matters to them) and also a “16 chunk working memory challenge” (that essentially no humans would be able to handle in reasonable amounts of time if the tests are working right) and therefore, by the end of the research project, we would see if it is possible to build an digital person with 16 slots of working memory… and then see what else they can do with all that headspace.
  Something I’m genuinely and deeply scientifically uncertain about is how and why working memory limits at all exist in “general minds”.
  Like what if there was something that could subitize 517 objects as “exactly 517 objects” as a single “atomic” act of “Looking” that fluently and easily was woven into all aspects of mind where that number of objects and their interactions could be pragmatically relevant?
  Is that even possible, from a computer science perspective?
  Greg Egan is very smart, and in Diaspora (the first chapter of which is still online for free) he had one of the adoptive digital parents (I want to say it was Blanca? maybe in Chapter 2 or 3?) explain to Yatima, the young orphan protagonist program, that minds in citizens and in fleshers and in physical robots and in everyone all work a certain way for reasons related to math, and there’s no such thing as a supermind with 35 slots of working memory… but Egan didn’t get into the math of it in the text. It might have been something he suspected for good reasons (and he is VERY smart and might have reasons), or it might have been hand-waving world-building that he put into the world so that Yatima and Blanca and so on could be psychologically intelligible to the human readers, and only have as many working memory registers as us, and it would be a story that a human reader can enjoy because he has human-intelligible characters.
  Assuming this limit is real, then here is the best short explanation I can offer for why such limits might be real: Some problems are NP-hard and need brute force. If you work on a problem like that with 5 elements then 5-factorial is only 120, and the human mind can check it pretty fast. (Like: 120 cortical columns could all work on it in parallel for 3 seconds, and then answer could then arise in the conscious mind as a brute percept that summarizes that work?)
  But if the same basic kind of problem has 15 elements you need to check 15*14*13… and now its 1.3 trillion things to check? And we only have like 3 million cortical columns? And so like, maybe nothing can do that very fast if they “checking” involves performing thousands of “ways of thinking about the interaction of a pair of Generic Things”.
  And if someone “accepts the challenge” and builds something with 15 slots with enough “ways of thinking” about all the slots for that to count as working memory slots for an intelligence algorithm to use as the theatre of its mind… then doing it for 16 things is sixteen times harder than just 15 slots! …and so on… the scaling here would just be brutal...
  So maybe a fluidly and fluently and fully general “human-like working memory with 17 slots for fully general concepts that can interact with each other in a conceptual way” simply can’t exist in practice in a materially instantiated mind, trapped in 3D, with thinking elements that can’t be smaller than atoms, with heat dissipation concerns like we deal with, and so on and so forth?
  Or… rather… because reality is full of structure and redundancy and modularity maybe it would be a huge waste? Better to reason in terms of modular chunks, with scientific reductionism and divide and conquer and so on? Having 10 chunk thoughts at a rate 1716 times faster (==13*12*11) than you have a single 13 chunk thought might be economically better? Or not? I don’t know for sure. But I think maybe something in this area is a deep deep structural “cause of why minds have the shape that minds have”.
  Fluid g is mysterious.
  Very controversial. Very hard to talk about with normies. A barren wasteland for scientists seeking prestige among democratic voters (who like to be praised and not challenged very much) who are (via delegation) offering grant funding to whomsoever seems like a good scientist to them.
  And yet also, if “what is done when fluid g is high and active” was counted as “a skill”, then it is the skill with the highest skill transfer of any skill, most likely! Yum! So healthy and good. I want some!
  If only we had more mad scientists, doing science in a way that wasn’t beholden to democratic grant giving systems <3
  Unless you believe that humans are venal monsters in general? Maybe humans will instantly weaponize cool shit, and use it to win unjust wars that cause net harm but transfer wealth to the winners of the unjust war? Then… I guess maybe it would be nice to have FEWER mad scientists?? Like preferably zero of them on Earth? So there are fewer insane new weapons? And fewer wars? And more justice and happiness instead? Maybe instead of researching intelligence we should research wise justice instead?
  As Critch says… safety isn’t safety without a social model.