For reference, I’d also bet on 8+ task length (on METR’s benchmark[1]) by 2027. Probably significantly earlier; maybe early 2026, or even end of this year. Would not be shocked if OpenAI’s IMO-winning model already clears that.
You say you expect progress to stall at 4-16 hours because solving such problems would require AIs to develop sophisticated models of them. My guess is that you’re using intuitions regarding at what task-lengths it would be necessary for a human. LLMs, however, are not playing by the same rules: where we might need a new model, they may be able to retrieve a stored template solution. I don’t think we really have any idea at what task length this trick would stop working for them. I could see it being “1 week”, or “1 month”, or “>1 year”, or “never”.
I do expect “<1 month”, though. Or rather, that even if the LLM architecture is able to support arbitrarily big templates, the scaling of data and compute will run out before this point; and then plausibly the investment and the talent pools would dry up as well (after LLMs betray everyone’s hopes of AGI-completeness).
Not sure what happens if we do get to “>1 year”, because on my model, LLMs might still not become AGIs despite that. Like, they would still be “solvers of already solved problems”, except they’d be… able to solve… any problem in the convex hull of the problems any human ever solved in 1 year...? I don’t know, that would be very weird; but things have already gone in very weird ways, and this is what the straightforward extrapolation of my current models says. (We do potentially die there.[2])
Aside: On my model, LLMs are not on track to hit any walls. They will keep getting better at the things they’ve been getting better at, at the same pace, for as long as the inputs to the process (compute, data, data progress, algorithmic progress) keep scaling at the same rate. My expectation is instead that they’re just not going towards AGI, so “no walls in their way” doesn’t matter; and that they will run out of fuel before the cargo cult of them becomes Singularity-tier transformative.
Recall that it uses unrealistically “clean” tasks and accepts unviable-in-practice solutions: the corresponding horizons for real-world problem-solving seem much shorter. As do the plausibly-much-more-meaningful 80%-completion horizons – this currently sits at 26 minute. (Something like 95%-completion horizons may actually be the most representative metric, though I assume there are some issues with estimating that.)
We should pause to note that a Clippy² still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. [...] When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)
Aside: On my model, LLMs are not on track to hit any walls. They will keep getting better at the things they’ve been getting better at, at the same pace, for as long as the inputs to the process (compute, data, data progress, algorithmic progress) keep scaling at the same rate. My expectation is instead that they’re just not going towards AGI, so “no walls in their way” doesn’t matter; and that they will run out of fuel before the cargo cult of them becomes Singularity-tier transformative.
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly). Under your model they have to hit a wall in this sense.
I think your main view is that LLMs won’t ever complete actually hard tasks and current benchmarks just aren’t measuring actually hard tasks or have other measurement issues? This seems inconsistent with saying they’ll just keep getting better though unless your hypothesizing truely insane benchmark flaws right?
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right? I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate
That’s only if the faster within-RLVR rate that has been holding during the last few months persists. On my current model, 1 month task lengths at 50% happen in 2030-2032, since compute (being the scarce input of scaling) slows down compared to today, and I don’t particularly believe in incremental algorithmic progress as it’s usually quantified, so it won’t be coming to the rescue.
Compared to the post I did on this 4 months ago, I have even lower expectations that the 5 GW training systems (for individual AI companies) will arrive on trend in 2028, they’ll probably get delayed to 2029-2031. And I think the recent RLVR acceleration of the pre-RLVR trend only pushes it forward a year without making it faster, the changed “trend” of the last few months is merely RLVR chip-hours catching up to pretraining chip-hours, which is already essentially over. Though there are still no GB200 NVL72 sized frontier models and probably no pretraining scale RLVR on GB200 NVL72s (which would get better compute utilization), so that might give the more recent “trend” another off-trend push first, perhaps as late as early 2026, but then it’s not yet a whole year ahead of the old trend.
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right?
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.
They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.
It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens ofmillions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
I think if this were right LLMs would already be useful for software engineering and able to make acceptable PRs.
I think LLMs can be useful for software engineering and can sometimes write acceptable PRs. (I’ve very clearly seen both of these first hand.) Maybe you meant something slightly weaker, like “AIs would be able to write acceptable PRs at a rate of >1/10 on large open source repos”? I think this is already probably true, at least with some scaffolding and inference time compute. Note that METR’s recent results were on 3.7 sonnet.
I’m referring to METR’s recent results. Can you point to any positive results on LLMs writing acceptable PRs? I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
METR found that0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
For reference, I’d also bet on 8+ task length (on METR’s benchmark[1]) by 2027. Probably significantly earlier; maybe early 2026, or even end of this year. Would not be shocked if OpenAI’s IMO-winning model already clears that.
You say you expect progress to stall at 4-16 hours because solving such problems would require AIs to develop sophisticated models of them. My guess is that you’re using intuitions regarding at what task-lengths it would be necessary for a human. LLMs, however, are not playing by the same rules: where we might need a new model, they may be able to retrieve a stored template solution. I don’t think we really have any idea at what task length this trick would stop working for them. I could see it being “1 week”, or “1 month”, or “>1 year”, or “never”.
I do expect “<1 month”, though. Or rather, that even if the LLM architecture is able to support arbitrarily big templates, the scaling of data and compute will run out before this point; and then plausibly the investment and the talent pools would dry up as well (after LLMs betray everyone’s hopes of AGI-completeness).
Not sure what happens if we do get to “>1 year”, because on my model, LLMs might still not become AGIs despite that. Like, they would still be “solvers of already solved problems”, except they’d be… able to solve… any problem in the convex hull of the problems any human ever solved in 1 year...? I don’t know, that would be very weird; but things have already gone in very weird ways, and this is what the straightforward extrapolation of my current models says. (We do potentially die there.[2])
Aside: On my model, LLMs are not on track to hit any walls. They will keep getting better at the things they’ve been getting better at, at the same pace, for as long as the inputs to the process (compute, data, data progress, algorithmic progress) keep scaling at the same rate. My expectation is instead that they’re just not going towards AGI, so “no walls in their way” doesn’t matter; and that they will run out of fuel before the cargo cult of them becomes Singularity-tier transformative.
(Obviously this model may be wrong. I’m still fluctuating around 80%.)
Recall that it uses unrealistically “clean” tasks and accepts unviable-in-practice solutions: the corresponding horizons for real-world problem-solving seem much shorter. As do the plausibly-much-more-meaningful 80%-completion horizons – this currently sits at 26 minute. (Something like 95%-completion horizons may actually be the most representative metric, though I assume there are some issues with estimating that.)
Probably this way:
Ok, but surely there has to be something they aren’t getting better at (or are getting better at too slowly). Under your model they have to hit a wall in this sense.
I think your main view is that LLMs won’t ever complete actually hard tasks and current benchmarks just aren’t measuring actually hard tasks or have other measurement issues? This seems inconsistent with saying they’ll just keep getting better though unless your hypothesizing truely insane benchmark flaws right?
Like, if they stop improving at <1 month horizon lengths (as you say immediately above the text I quoted) that is clearly a case of LLMs hitting a wall right? I agree that compute and resources running out could cause this, but it’s notable that we expect ~1 month in not that long, like only ~3 years at the current rate.
That’s only if the faster within-RLVR rate that has been holding during the last few months persists. On my current model, 1 month task lengths at 50% happen in 2030-2032, since compute (being the scarce input of scaling) slows down compared to today, and I don’t particularly believe in incremental algorithmic progress as it’s usually quantified, so it won’t be coming to the rescue.
Compared to the post I did on this 4 months ago, I have even lower expectations that the 5 GW training systems (for individual AI companies) will arrive on trend in 2028, they’ll probably get delayed to 2029-2031. And I think the recent RLVR acceleration of the pre-RLVR trend only pushes it forward a year without making it faster, the changed “trend” of the last few months is merely RLVR chip-hours catching up to pretraining chip-hours, which is already essentially over. Though there are still no GB200 NVL72 sized frontier models and probably no pretraining scale RLVR on GB200 NVL72s (which would get better compute utilization), so that might give the more recent “trend” another off-trend push first, perhaps as late as early 2026, but then it’s not yet a whole year ahead of the old trend.
I distinguish “the LLM paradigm hitting a wall” and “the LLM paradigm running out of fuel for further scaling”.
Yes, precisely. Last I checked, we expected scaling to run out by 2029ish, no?
Ah, reading the comments, I see you expect there to be some inertia… Okay, 2032 / 7 more years would put us at “>1 year” task horizons. That does make me a bit more concerned. (Though 80% reliability is several doublings behind, and I expect tasks that involve real-world messiness to be even further behind.)
“Ability to come up with scientific innovations” seems to be one.
Like, I expect they are getting better at the underlying skill. If you had a benchmark which measures some toy version of “produce scientific innovations” (AidanBench?), and you plotted frontier models’ performance on it against time, you would see the number going up. But it currently seems to lag way behind other capabilities, and I likewise don’t expect it to reach dangerous heights before scaling runs out.
The way I would put it, the things LLMs are strictly not improving on are not “specific types of external tasks”. What I think they’re not getting better at – because it’s something they’ve never been capable of doing – are specific cognitive algorithms which allow to complete certain cognitive tasks in a dramatically more compute-efficient manner. We’ve talked about this some before.
I think that, in the limit of scaling, the LLM paradigm is equivalent to AGI, but that it’s not a very efficient way to approach this limit. And it’s less efficient along some dimensions of intelligence than along others.
This paradigm attempts to scale certain modules a generally intelligent mind would have to ridiculous levels of power in order to make up for the lack of other necessary modules. This will keep working to improve performance across all tasks, as long as you keep feeding LLMs more data and compute. But there seems to be only a few “GPT-4 to GPT-5” jumps left, and I don’t think it’d be enough.
I think if this were right LLMs would already be useful for software engineering and able to make acceptable PRs.
I also guess that the level of agency you need to actually beat Pokémon is require probably somewhere around 4 hours.
We’ll see who’s right—bet against me if you haven’t already! Though maybe it’s not a good deal anymore. I can see it going either way.
They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.
It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens of millions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
However, there is Knight Lee’s proposal where the AIs are to generate multiple tokens instead of using versions of neuralese.
For comparison, the longest contest window is 1M tokens long and is used by Google Gemini. 1M tokens are represented by 16.6 M bits.
People have been talking about neuralese since at least when AI 2027 was published and I think much earlier, but it doesn’t seem to have materialized.
I think LLMs can be useful for software engineering and can sometimes write acceptable PRs. (I’ve very clearly seen both of these first hand.) Maybe you meant something slightly weaker, like “AIs would be able to write acceptable PRs at a rate of >1/10 on large open source repos”? I think this is already probably true, at least with some scaffolding and inference time compute. Note that METR’s recent results were on 3.7 sonnet.
I’m referring to METR’s recent results. Can you point to any positive results on LLMs writing acceptable PRs? I’m sure that they can in some weak sense e.g. a sufficiently small project with sufficiently low standards, but as far as I remember the METR study concluded zero acceptable PRs in their context.
METR found that 0⁄4 PRs which passed test cases and they reviewed were also acceptable to review. This was for 3.7 sonnet on large open source repos with default infrastructure.
The rate at which PRs passed test cases was also low, but if you’re focusing on the PR being viable to merge conditional on passing test cases, the “0/4” number is what you want. (And this is consistent with 10% or some chance of 35% of PRs being mergable conditional on passing test cases, we don’t have a very large sample size here.)
I don’t think this is much evidence that AI can’t sometimes write acceptable PRs in general and there examples of AIs doing this. On small projects I’ve worked on, AIs from a long time ago have written a big chunk of code ~zero-shot. Anecdotally, I’ve heard of people having success with AIs completing tasks zero-shot. I don’t know what you mean by “PR” that doesn’t include this.
I think I already answered this: