Aside: On my model, LLMs are not on track to hit any walls. They will keep getting better at the things they’ve been getting better at, at the same pace, for as long as the inputs to the process (compute, data, data progress, algorithmic progress) keep scaling at the same rate. My expectation is instead that they’re just not going towards AGI, so “no walls in their way” doesn’t matter; and that they will run out of fuel before the cargo cult of them becomes Singularity-tier transformative.
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly). Under your model they have to hit a wall in this sense.
I think your main view is that LLMs won’t ever complete actually hard tasks and current benchmarks just aren’t measuring actually hard tasks or have other measurement issues? This seems inconsistent with saying they’ll just keep getting better though unless your hypothesizing truely insane benchmark flaws right?
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right? I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate
That’s only if the faster within-RLVR rate that has been holding during the last few months persists. On my current model, 1 month task lengths at 50% happen in 2030-2032, since compute (being the scarce input of scaling) slows down compared to today, and I don’t particularly believe in incremental algorithmic progress as it’s usually quantified, so it won’t be coming to the rescue.
Compared to the post I did on this 4 months ago, I have even lower expectations that the 5 GW training systems (for individual AI companies) will arrive on trend in 2028, they’ll probably get delayed to 2029-2031. And I think the recent RLVR acceleration of the pre-RLVR trend only pushes it forward a year without making it faster, the changed “trend” of the last few months is merely RLVR chip-hours catching up to pretraining chip-hours, which is already essentially over. Though there are still no GB200 NVL72 sized frontier models and probably no pretraining scale RLVR on GB200 NVL72s (which would get better compute utilization), so that might give the more recent “trend” another off-trend push first, perhaps as late as early 2026, but then it’s not yet a whole year ahead of the old trend.
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right?
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly). Under your model they have to hit a wall in this sense.
I think your main view is that LLMs won’t ever complete actually hard tasks and current benchmarks just aren’t measuring actually hard tasks or have other measurement issues? This seems inconsistent with saying they’ll just keep getting better though unless your hypothesizing truely insane benchmark flaws right?
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right? I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
That’s only if the faster within-RLVR rate that has been holding during the last few months persists. On my current model, 1 month task lengths at 50% happen in 2030-2032, since compute (being the scarce input of scaling) slows down compared to today, and I don’t particularly believe in incremental algorithmic progress as it’s usually quantified, so it won’t be coming to the rescue.
Compared to the post I did on this 4 months ago, I have even lower expectations that the 5 GW training systems (for individual AI companies) will arrive on trend in 2028, they’ll probably get delayed to 2029-2031. And I think the recent RLVR acceleration of the pre-RLVR trend only pushes it forward a year without making it faster, the changed “trend” of the last few months is merely RLVR chip-hours catching up to pretraining chip-hours, which is already essentially over. Though there are still no GB200 NVL72 sized frontier models and probably no pretraining scale RLVR on GB200 NVL72s (which would get better compute utilization), so that might give the more recent “trend” another off-trend push first, perhaps as late as early 2026, but then it’s not yet a whole year ahead of the old trend.
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.