Great post. I think the central claim is plausible, and would very much like to find out I’m in a world where AGI is decades away instead of years. We might be ready by then.
If I am reading this correctly, there are two specific tests you mention:
1) GPT-5 level models come out on schedule (as @Julian Bradshaw noted, we are still well within the expected timeframe based on trends to this point)
2) LLMs or agents built on LLMs do something “important” in some field of science, math, or writing
I would add on test 2 that neither have almost all humans. We don’t have a clear explanation for why some humans have much more of this capability than others, and yet all the human brains are running on similar hardware and software. This suggests the number of additional insights needed to boost us from “can’t do novel important things” to “can do” may be as small as zero, though I don’t think it is actually zero. In any case, I am hesitant to embrace a test for AGI that a large majority of humans fail.
In practical terms, suppose this summer OpenAI releases GPT-5-o4, and by winter it’s the lead author on a theoretical physics or pure math paper (or at least the main contributor—legal considerations about personhood and IP might stop people from calling AI the author). How would that affect your thinking?
I think the central claim is plausible, and would very much like to find out I’m in a world where AGI is decades away instead of years. We might be ready by then.
Me too!
If I am reading this correctly, there are two specific tests you mention:
1) GPT-5 level models come out on schedule (as @Julian Bradshaw noted, we are still well within the expected timeframe based on trends to this point)
See my response to his comment—I think its not so clear that projecting those trends invalidates my model, but it really depends on whether GPT-5 is actually a qualitative upgrade comparable to the previous steps, which we do not know yet.
2) LLMs or agents built on LLMs do something “important” in some field of science, math, or writing
I would add on test 2 that neither have almost all humans. We don’t have a clear explanation for why some humans have much more of this capability than others, and yet all the human brains are running on similar hardware and software. This suggests the number of additional insights needed to boost us from “can’t do novel important things” to “can do” may be as small as zero, though I don’t think it is actually zero. In any case, I am hesitant to embrace a test for AGI that a large majority of humans fail.
This seems about right, but there are two points to keep in mind.
a) It is more surprising that LLMs can’t do anything important because their knowledge far surpasses any humans, which indicates that there is some kind of cognitive function qualitatively missing.
b) I think that about the bottom 30% (very rough estimate) of humans in developed nations are essentially un-agentic. The kind of major discoveries and creations I pointed to mostly come from the top 1%. However, I think that in the middle of that range there are still plenty of people capable of knowledge work. I don’t see LLMs managing the sort of project that would take a mediocre mid-level employee a week or month. So there’s a gap here, even between LLMs and ordinary humans. I am not as certain about this as I am about the stronger test, but it lines up with my experience with DeepResearch—I asked it for a literature review of my field and it had pretty serious problems that would have made it unusable, despite requiring ~no knowledge creation (I can email you an annotated copy if you’re interested).
In practical terms, suppose this summer OpenAI releases GPT-5-o4, and by winter it’s the lead author on a theoretical physics or pure math paper (or at least the main contributor—legal considerations about personhood and IP might stop people from calling AI the author). How would that affect your thinking?
Assuming the results of the paper are true (everyone would check) and at least somewhat novel/interesting (~sufficient for the journal to be credible) this would completely change my mind. As I said, it is a crux.
My own understanding is that other than maybe writing code, no one has actually given LLMs the kind of training a talented human gets towards becoming the kind of person capable of performing novel and useful intellectual work. An LLM has a lot of knowledge, but knowledge isn’t what makes useful and novel intellectual work achievable. A non-reasoning model gives you the equivalent of a top-of-mind answer. A reasoning model with a large context window and chain of thought can do better, and solve more complex problems, but still mostly those within the limits of a newly hired college or grad student.
I genuinely don’t know whether an LLM with proper training can do novel intellectual work at current capabilities levels. To find out in a way I’d find convincing would take someone giving it the hundreds of thousands of dollars and subjective years’ worth of guidance and feedback and iteration that humans get. And really, you’d have to do this at least hundreds of times, for different fields and with different pedagogical methods, to even slightly satisfactorily demonstrate a “no,” because 1) most humans empirically fail at this, and 2) those that succeed don’t all do so in the same field or by the same path.
Great post. I think the central claim is plausible, and would very much like to find out I’m in a world where AGI is decades away instead of years. We might be ready by then.
If I am reading this correctly, there are two specific tests you mention:
1) GPT-5 level models come out on schedule (as @Julian Bradshaw noted, we are still well within the expected timeframe based on trends to this point)
2) LLMs or agents built on LLMs do something “important” in some field of science, math, or writing
I would add on test 2 that neither have almost all humans. We don’t have a clear explanation for why some humans have much more of this capability than others, and yet all the human brains are running on similar hardware and software. This suggests the number of additional insights needed to boost us from “can’t do novel important things” to “can do” may be as small as zero, though I don’t think it is actually zero. In any case, I am hesitant to embrace a test for AGI that a large majority of humans fail.
In practical terms, suppose this summer OpenAI releases GPT-5-o4, and by winter it’s the lead author on a theoretical physics or pure math paper (or at least the main contributor—legal considerations about personhood and IP might stop people from calling AI the author). How would that affect your thinking?
Me too!
See my response to his comment—I think its not so clear that projecting those trends invalidates my model, but it really depends on whether GPT-5 is actually a qualitative upgrade comparable to the previous steps, which we do not know yet.
This seems about right, but there are two points to keep in mind.
a) It is more surprising that LLMs can’t do anything important because their knowledge far surpasses any humans, which indicates that there is some kind of cognitive function qualitatively missing.
b) I think that about the bottom 30% (very rough estimate) of humans in developed nations are essentially un-agentic. The kind of major discoveries and creations I pointed to mostly come from the top 1%. However, I think that in the middle of that range there are still plenty of people capable of knowledge work. I don’t see LLMs managing the sort of project that would take a mediocre mid-level employee a week or month. So there’s a gap here, even between LLMs and ordinary humans. I am not as certain about this as I am about the stronger test, but it lines up with my experience with DeepResearch—I asked it for a literature review of my field and it had pretty serious problems that would have made it unusable, despite requiring ~no knowledge creation (I can email you an annotated copy if you’re interested).
Assuming the results of the paper are true (everyone would check) and at least somewhat novel/interesting (~sufficient for the journal to be credible) this would completely change my mind. As I said, it is a crux.
Fair enough, thanks.
My own understanding is that other than maybe writing code, no one has actually given LLMs the kind of training a talented human gets towards becoming the kind of person capable of performing novel and useful intellectual work. An LLM has a lot of knowledge, but knowledge isn’t what makes useful and novel intellectual work achievable. A non-reasoning model gives you the equivalent of a top-of-mind answer. A reasoning model with a large context window and chain of thought can do better, and solve more complex problems, but still mostly those within the limits of a newly hired college or grad student.
I genuinely don’t know whether an LLM with proper training can do novel intellectual work at current capabilities levels. To find out in a way I’d find convincing would take someone giving it the hundreds of thousands of dollars and subjective years’ worth of guidance and feedback and iteration that humans get. And really, you’d have to do this at least hundreds of times, for different fields and with different pedagogical methods, to even slightly satisfactorily demonstrate a “no,” because 1) most humans empirically fail at this, and 2) those that succeed don’t all do so in the same field or by the same path.