After using Claude Code for a while, I can’t help but conclude that today’s frontier LLMs mostly meet the bar for what I’d consider AGI—with the exception of two things, that, I think, explain most of their shortcomings:
lack of real multimodality
context window limitations
Most frontier models are marketed as multimodal, but this is often limited to text + some way to encode images. And while LLM vision is OK for many practical purposes, it’s far from perfect, and even if they had perfect sight, being limited to singular images is still a huge limitation[1].
Imagine you, with your human general intelligence, were sitting in a dark room, and were conversing with someone who has a complex, difficult problem to solve, and you do your best to help them. But you can only communicate through a mostly text-based interface that allows this person to send you occasional screenshots or photos. Further imagine that every hour or so you lose your entire memory & mental model of the problem, and find yourself with nothing but a high-level and very lossy summary of what has been discussed before.
I think it’s very likely that under such restrictive circumstances, it’s just very hard to not run into all kinds of failure modes and limitations of capability, even for the undoubtedly general intelligence that is you.
So, in some sense, I’d think that there’s an “intelligence overhang”, where the raw intelligence that exists in these LLMs can’t fully unfold due to modality & context window limitations. These limitations mean that Claude Code et al. don’t yet show the effects on the economy and world as a whole that many would have expected from AGI. But I’d argue it makes sense to decouple the actual “intelligence” from the limiting way in which it’s currently bound to interact with the world—even if, as some might correctly argue, modality & context window are just an inherent property of LLMs. Because this is an important detail about the state of things that, I suppose, is neither part of most of the definitions people gave for AGI in the past, nor of the vague intuitions they had about what the term means.
as opposed to, say, understanding video, including sound, and including a sense of time. (This is not to say that vision is necessary for general intelligence, of course; but that’s kind of my whole point: the general intelligence is already there, it’s just that the modality + context restrictions mean AI is still much less effective at influencing the world in the way that a “naively” imagined AGI would)
I think jaggedness of RL (in modern LLMs) is an obstruction that would need to be addressed explicitly, otherwise it won’t fall to incremental improvements or scaffolding. There are two very different levels of capability, obtained in pretraining and in RLVR, but only pretraining is somewhat general. And even pretraining doesn’t adapt to novel situations other than through in-context learning, which only expresses capabilities at the level of pretraining, significantly weaker than RLVR-trained narrow capabilities.
Scaling will make pretraining stronger, but probably not sufficiently to matter for this issue, and natural text data will only last for another step of improvement similar to what happened in 2023-2025 (in pretraining only, ignoring RLVR). If RL doesn’t get more general, it’ll probably remain useless for improving general capabilities outside the skills trained with RLVR. Capabilities will remain jagged, with gaps that have to be addressed manually by changing the training data.
This could change within a few years, possibly even faster if LLMs can be RLVRed to become able to RLVR themselves, though that won’t necessarily work. Or via next token prediction RLVR that makes pretraining stronger without requiring more natural text data, but this probably needs much more compute even if it works in principle, so might also take 5-10 years, to uncertain capability level results.
So, in some sense, I’d think that there’s an “intelligence overhang”, where the raw intelligence that exists in these LLMs can’t fully unfold due to modality & context window limitations.
Another missing piece is research taste or curiosity. Of the sort you would need to come up with ideas for new papers.
After using Claude Code for a while, I can’t help but conclude that today’s frontier LLMs mostly meet the bar for what I’d consider AGI—with the exception of two things, that, I think, explain most of their shortcomings:
lack of real multimodality
context window limitations
Most frontier models are marketed as multimodal, but this is often limited to text + some way to encode images. And while LLM vision is OK for many practical purposes, it’s far from perfect, and even if they had perfect sight, being limited to singular images is still a huge limitation[1].
Imagine you, with your human general intelligence, were sitting in a dark room, and were conversing with someone who has a complex, difficult problem to solve, and you do your best to help them. But you can only communicate through a mostly text-based interface that allows this person to send you occasional screenshots or photos. Further imagine that every hour or so you lose your entire memory & mental model of the problem, and find yourself with nothing but a high-level and very lossy summary of what has been discussed before.
I think it’s very likely that under such restrictive circumstances, it’s just very hard to not run into all kinds of failure modes and limitations of capability, even for the undoubtedly general intelligence that is you.
So, in some sense, I’d think that there’s an “intelligence overhang”, where the raw intelligence that exists in these LLMs can’t fully unfold due to modality & context window limitations. These limitations mean that Claude Code et al. don’t yet show the effects on the economy and world as a whole that many would have expected from AGI. But I’d argue it makes sense to decouple the actual “intelligence” from the limiting way in which it’s currently bound to interact with the world—even if, as some might correctly argue, modality & context window are just an inherent property of LLMs. Because this is an important detail about the state of things that, I suppose, is neither part of most of the definitions people gave for AGI in the past, nor of the vague intuitions they had about what the term means.
as opposed to, say, understanding video, including sound, and including a sense of time. (This is not to say that vision is necessary for general intelligence, of course; but that’s kind of my whole point: the general intelligence is already there, it’s just that the modality + context restrictions mean AI is still much less effective at influencing the world in the way that a “naively” imagined AGI would)
I think jaggedness of RL (in modern LLMs) is an obstruction that would need to be addressed explicitly, otherwise it won’t fall to incremental improvements or scaffolding. There are two very different levels of capability, obtained in pretraining and in RLVR, but only pretraining is somewhat general. And even pretraining doesn’t adapt to novel situations other than through in-context learning, which only expresses capabilities at the level of pretraining, significantly weaker than RLVR-trained narrow capabilities.
Scaling will make pretraining stronger, but probably not sufficiently to matter for this issue, and natural text data will only last for another step of improvement similar to what happened in 2023-2025 (in pretraining only, ignoring RLVR). If RL doesn’t get more general, it’ll probably remain useless for improving general capabilities outside the skills trained with RLVR. Capabilities will remain jagged, with gaps that have to be addressed manually by changing the training data.
This could change within a few years, possibly even faster if LLMs can be RLVRed to become able to RLVR themselves, though that won’t necessarily work. Or via next token prediction RLVR that makes pretraining stronger without requiring more natural text data, but this probably needs much more compute even if it works in principle, so might also take 5-10 years, to uncertain capability level results.
Another missing piece is research taste or curiosity. Of the sort you would need to come up with ideas for new papers.
Does “multi-modality” include features like having a physical world model, such that it could input sensible commands to robot body, for instance?