Often, qualitative differences turn out to be quantitative, especially in AI progress. As The Bitter Lesson pointed out in 2019, jumps in capabilities often don’t need some breakthrough or human ingenuity, but merely (much) more of the same, that is, scaling up the compute. And so we went from GPT2, which could produce English text with mostly flawless grammar but not much more, to the multilingual GPT3.5 that could write entire essays, to later models that are coming for most white collar jobs.
This naturally raises the question which other limitations exist in AI that seem qualitative, but end up being pretty much solved by the same thing but bigger. I wonder about three areas in particular:
Continual learning
Reliability & hallucinations
Multi-modality much closer to the human experience (something like audio-visual with depth- and time perception)
For all of these, it’s tempting to claim that they require some big breakthrough or entirely different approach than LLMs, and that the default would be that these current limitations will pose natural upper bounds to the impact of LLMs on our world. And I can well imagine that certain breakthroughs could greatly accelerate progress in these areas. But I also can’t help but suspect that even without major breakthroughs, we’ll inevitably see serious progress on these fronts anyway.
Continual learning: context window sizes didn’t see the rapid progress of some other areas & benchmarks, but even then, today’s frontier models have ~10x the context window of 2023. It’s not the primary thing labs are optimizing, but it seems overwhelmingly likely to me that algorithmic + hardware progress will lead to larger context windows of the years. And if we do reach 10M or 100M token context windows eventually, I wouldn’t be surprised if that (combined with other capability improvements) will be sufficient to make in-context learning capable enough to mostly alleviate the need for true continual learning for most economically valuable purposes. Sure, if somebody figures out true scalable & robust continual learning, then that’s an even bigger deal[1]. But I’d argue that even if this for whatever reason does not come to pass, merely scaling up context window sizes could eventually be sufficient to surpass the “context persistence advantages” of humans.[2]
Reliability & hallucinations: some people assume that LLMs will always hallucinate and it will take a fundamentally different approach to overcome this. Maybe they’re right, but at least in agentic coding we see that if you get the feedback loops right and “tether the model” to some verifiable part of reality, hallucinations mostly become a non-issue. It’s unclear to me how far this will actually work & scale in other areas, and Sam Altman’s prediction from 2023 that two years from then “we won’t still talk about” hallucinations certainly turned out to be incorrect. But I wouldn’t be surprised if some relatively marginal changes, such as forms of embodiment[3] or best-of-n style answers, or whatever other surprisingly simple strategy will be identified in the meantime, end up increasing reliability greatly.
Multi-Modality: in principle, a larger context window might allow just providing an LLM with 100s of images representing some form of livestream from a camera (or two), and appropriate training or reasoning might allow it to “perceive” movement. On the one hand, I’d think that it’s a huge disadvantage for the LLM if the “time modality” is not properly represented in the way its inputs are tokenized[4]. But on the other hand, it still seems conceivable that even such a suboptimal encoding of movement as “100 separate tokenized still images” could be handled by more advanced LLMs well enough to basically solve current limitations of LLM perception[5].
I’m not claiming that any of this is what is going to happen. Multi-modality in particular seems like something labs could expand a lot if it was a priority, but they just happen to focus on other areas that are more lucrative on the current margin. Either way, the point of this post is just to point out that I do think that these developments may be a bit of a lower bound of AI progress. Even if no major breakthroughs occur, I’d still assume we eventually end up
with in-context learning capable enough to surpass humans in many areas where we would currently assume continual learning to be required
with fewer and fewer hallucinations in many areas
and with AI models that can perceive the world in very similar ways to us, in so far as that’s helpful for the area they’re deployed in (and in many ways that may go way beyond the limits of human perception)
Somewhat relatedly to this, I also get the impression that much of what’s currently happening in the AI coding landscape (around skills, MCPs, agents/claude.md files, memory, context management...) is to some degree “overfitting” on the current margin of AI capability and will in future generations get obsolete once LLMs become better at building & persisting meaningful context themselves dynamically. We’re in this fun phase, where humans can still teach LLMs a lot to make them more useful, but I highly doubt this phase will last very long.
My thought here being that some form of embodiment “nails” the AI to reality and (directionally) prevents it from spiraling out of control in strange failure modes; of course, it might still turn psychotic for various reasons, but having a constant stream of “reality” almost certainly might have some grounding influence compared to its current reality that largely consists of its own thoughts, system prompts and the ramblings of its conversation partner.
E.g., CNNs seem conceptually nice in that they encode a certain prior about the modality of images, in that neighboring pixels tend to be more relevant for each other than more distant pixels. Similarly and reversed, providing frames of a video as entirely separate images just seems lacking, as the temporal connection isn’t really encoded, but just kind of “interpreted into it” after the fact.
To name one example of the limitations I mean here: if you’re working on a website and add some subtle animations to improve UX, this is something today’s coding agents have a very hard time testing. They can generally use browsers, click around, look at different screenshots, but this usually happens “one screenshot at a time” and does not include animations. They can still implement animations, and often do a good job at that, but they’re typically doing this blindly. Any human, on the other hand, who would use this website, would instantly and automatically perceive animations, and notice when they’re off in any considerable way.
None of this helps with automatically acquiring deep skills like playing good chess or fluency in a novel topic of math, and so these aren’t the stright lines on graphs directly relevant to crossing the AGI threshold, full automation of civilization.
Humans don’t know how to automate learning of arbitrary deep skills in an AI that only come up post-deployment, but can manually add them with RLVR at training time, by developing RL environments, graders, and tasks. AI might automate this process not by doing what humans couldn’t and inventing algorithmic advancements for low level acquisition of deep skills, but instead by merely being smart and skilled enough to do all the same things that humans are currently doing to make it work, “manually”. So in principle AI might become able to automatically acquire deep skills if it’s capable enough at routine AI R&D, even if it doesn’t have the capability to acquire deep skills at a low level, the way humans do, and doesn’t have the capability to invent substantial algorithmic innovations that humans haven’t invented yet. Some of the straight lines on graphs are relevant to when this might happen, and so indirectly they are relevant to crossing the AGI threshold.
I don’t think in-context learning or even true continual learning with anything like the current methods can automate acquisition of deep skills at a low level, because only RLVR currently works for that purpose, context persistence is essentially unrelated. But these things might get AIs to the level of capability where they can do the same things as the humans who set up the ingredients for task-specific RLVR.
Even without with longer contexts, LLMs being able to use notes effectively seems like the kind of skill issue that will likely improve over time with or without algorithmic breakthroughs. 1M token context is already way more than a human can keep track of without notes.
True, it’s possible larger context windows aren’t even needed and 1M is sufficient for the majority of our economy to get automated.
I also think it’s easy to underestimate how much context humans actually gather over the years though. E.g. in my job there’s a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn’t necessarily.
To list a few such things:
Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them
I’d assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).
I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can’t just easily export on demand—it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it’s possible that “the essence” of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.
I think this is misunderstanding of the bitter lesson. Bitter lesson says that instead of handcrafted ontology you need method to leverage large amount of data and compute to discover ontology. Transformer architecture is human ingenuity here.
Often, qualitative differences turn out to be quantitative, especially in AI progress. As The Bitter Lesson pointed out in 2019, jumps in capabilities often don’t need some breakthrough or human ingenuity, but merely (much) more of the same, that is, scaling up the compute. And so we went from GPT2, which could produce English text with mostly flawless grammar but not much more, to the multilingual GPT3.5 that could write entire essays, to later models that are coming for most white collar jobs.
This naturally raises the question which other limitations exist in AI that seem qualitative, but end up being pretty much solved by the same thing but bigger. I wonder about three areas in particular:
Continual learning
Reliability & hallucinations
Multi-modality much closer to the human experience (something like audio-visual with depth- and time perception)
For all of these, it’s tempting to claim that they require some big breakthrough or entirely different approach than LLMs, and that the default would be that these current limitations will pose natural upper bounds to the impact of LLMs on our world. And I can well imagine that certain breakthroughs could greatly accelerate progress in these areas. But I also can’t help but suspect that even without major breakthroughs, we’ll inevitably see serious progress on these fronts anyway.
Continual learning: context window sizes didn’t see the rapid progress of some other areas & benchmarks, but even then, today’s frontier models have ~10x the context window of 2023. It’s not the primary thing labs are optimizing, but it seems overwhelmingly likely to me that algorithmic + hardware progress will lead to larger context windows of the years. And if we do reach 10M or 100M token context windows eventually, I wouldn’t be surprised if that (combined with other capability improvements) will be sufficient to make in-context learning capable enough to mostly alleviate the need for true continual learning for most economically valuable purposes. Sure, if somebody figures out true scalable & robust continual learning, then that’s an even bigger deal[1]. But I’d argue that even if this for whatever reason does not come to pass, merely scaling up context window sizes could eventually be sufficient to surpass the “context persistence advantages” of humans.[2]
Reliability & hallucinations: some people assume that LLMs will always hallucinate and it will take a fundamentally different approach to overcome this. Maybe they’re right, but at least in agentic coding we see that if you get the feedback loops right and “tether the model” to some verifiable part of reality, hallucinations mostly become a non-issue. It’s unclear to me how far this will actually work & scale in other areas, and Sam Altman’s prediction from 2023 that two years from then “we won’t still talk about” hallucinations certainly turned out to be incorrect. But I wouldn’t be surprised if some relatively marginal changes, such as forms of embodiment[3] or best-of-n style answers, or whatever other surprisingly simple strategy will be identified in the meantime, end up increasing reliability greatly.
Multi-Modality: in principle, a larger context window might allow just providing an LLM with 100s of images representing some form of livestream from a camera (or two), and appropriate training or reasoning might allow it to “perceive” movement. On the one hand, I’d think that it’s a huge disadvantage for the LLM if the “time modality” is not properly represented in the way its inputs are tokenized[4]. But on the other hand, it still seems conceivable that even such a suboptimal encoding of movement as “100 separate tokenized still images” could be handled by more advanced LLMs well enough to basically solve current limitations of LLM perception[5].
I’m not claiming that any of this is what is going to happen. Multi-modality in particular seems like something labs could expand a lot if it was a priority, but they just happen to focus on other areas that are more lucrative on the current margin. Either way, the point of this post is just to point out that I do think that these developments may be a bit of a lower bound of AI progress. Even if no major breakthroughs occur, I’d still assume we eventually end up
with in-context learning capable enough to surpass humans in many areas where we would currently assume continual learning to be required
with fewer and fewer hallucinations in many areas
and with AI models that can perceive the world in very similar ways to us, in so far as that’s helpful for the area they’re deployed in (and in many ways that may go way beyond the limits of human perception)
And to be fair, my best guess is that continual learning will see some breakthroughs in the next 1-3 years and will essentially get solved.
Somewhat relatedly to this, I also get the impression that much of what’s currently happening in the AI coding landscape (around skills, MCPs, agents/claude.md files, memory, context management...) is to some degree “overfitting” on the current margin of AI capability and will in future generations get obsolete once LLMs become better at building & persisting meaningful context themselves dynamically. We’re in this fun phase, where humans can still teach LLMs a lot to make them more useful, but I highly doubt this phase will last very long.
My thought here being that some form of embodiment “nails” the AI to reality and (directionally) prevents it from spiraling out of control in strange failure modes; of course, it might still turn psychotic for various reasons, but having a constant stream of “reality” almost certainly might have some grounding influence compared to its current reality that largely consists of its own thoughts, system prompts and the ramblings of its conversation partner.
E.g., CNNs seem conceptually nice in that they encode a certain prior about the modality of images, in that neighboring pixels tend to be more relevant for each other than more distant pixels. Similarly and reversed, providing frames of a video as entirely separate images just seems lacking, as the temporal connection isn’t really encoded, but just kind of “interpreted into it” after the fact.
To name one example of the limitations I mean here: if you’re working on a website and add some subtle animations to improve UX, this is something today’s coding agents have a very hard time testing. They can generally use browsers, click around, look at different screenshots, but this usually happens “one screenshot at a time” and does not include animations. They can still implement animations, and often do a good job at that, but they’re typically doing this blindly. Any human, on the other hand, who would use this website, would instantly and automatically perceive animations, and notice when they’re off in any considerable way.
None of this helps with automatically acquiring deep skills like playing good chess or fluency in a novel topic of math, and so these aren’t the stright lines on graphs directly relevant to crossing the AGI threshold, full automation of civilization.
Humans don’t know how to automate learning of arbitrary deep skills in an AI that only come up post-deployment, but can manually add them with RLVR at training time, by developing RL environments, graders, and tasks. AI might automate this process not by doing what humans couldn’t and inventing algorithmic advancements for low level acquisition of deep skills, but instead by merely being smart and skilled enough to do all the same things that humans are currently doing to make it work, “manually”. So in principle AI might become able to automatically acquire deep skills if it’s capable enough at routine AI R&D, even if it doesn’t have the capability to acquire deep skills at a low level, the way humans do, and doesn’t have the capability to invent substantial algorithmic innovations that humans haven’t invented yet. Some of the straight lines on graphs are relevant to when this might happen, and so indirectly they are relevant to crossing the AGI threshold.
I don’t think in-context learning or even true continual learning with anything like the current methods can automate acquisition of deep skills at a low level, because only RLVR currently works for that purpose, context persistence is essentially unrelated. But these things might get AIs to the level of capability where they can do the same things as the humans who set up the ingredients for task-specific RLVR.
Even without with longer contexts, LLMs being able to use notes effectively seems like the kind of skill issue that will likely improve over time with or without algorithmic breakthroughs. 1M token context is already way more than a human can keep track of without notes.
True, it’s possible larger context windows aren’t even needed and 1M is sufficient for the majority of our economy to get automated.
I also think it’s easy to underestimate how much context humans actually gather over the years though. E.g. in my job there’s a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn’t necessarily.
To list a few such things:
Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them
I’d assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).
I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can’t just easily export on demand—it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it’s possible that “the essence” of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.
I think this is misunderstanding of the bitter lesson. Bitter lesson says that instead of handcrafted ontology you need method to leverage large amount of data and compute to discover ontology. Transformer architecture is human ingenuity here.