The issue I’m raising is similar to how you seem to be taking METR time horizon metrics (in their real world form as opposed to impractical idealized form, a theoretical definition of having surpassed humans) as meaningful indicators about takeoff timelines. There are lots of proxy measures that I think don’t mean anything qualitatively (about time to actual full automation of civilization) even over some orders of magnitude in capability metrics. You might acknowledge they are in some ways causally uncoupled from what drives the real takeoff milestones, but you still seem to be importantly relying on them in forecasting of timelines.
So I’m getting the same impression in the way you are framing 0% code reviewed (in contrast to 0% code written!), it just doesn’t seem very relevant in a way your writing doesn’t take a clear stance on (even as you are gesturing at measurable advance proxies being conceptually distinct from causal takeoff milestones). I think quantitative advances mostly matter only to the extent they increase pre-AGI TAM for AI companies, which increases research and training compute available to them, which compresses the time in which the researchers get to explore more of the algorithmic low-hanging fruit (in particular seeing more clearly what the more promising ideas actually amount to when scaled with serious resources).
I acknowledge that METR time horizon has loads of limitations, my position has just been that it’s the least bad benchmark to extrapolate / the best single piece of evidence we have. Do you have a better suggestion / alternative?
Similarly, re: 0% code reviewed: Seems similarly relevant to me as the 0% code written milestone. Would you agree? Do you have other milestones to point to which you think are more relevant than either?
I think benchmark extrapolation primarily helps with figuring out how far you can go for a given architecture/method and level of compute, and to estimate what level of TAM it makes available to build more compute. Benchmarks respond both to scaling and to incremental algorithmic improvements, so there’s a feedback loop for any given toolset of methods. But at some point the current architecture/method can be written off as insufficient to fully automate civilization until a new breakthrough sufficiently changes something. Which takes an amount of time that’s probably not predictable from benchmark extrapolation for the previous methods.
There’s uncertainty about whether any given architecture/method is sufficient to fully automate civilization, and it gets gradually resolved as it gets scaled closer to the limits of available compute and as a few years pass with a given level of compute to explore incremental algorithmic improvements. Benchmarks are always extremely deceptive about whether you’ll get all the way to fully automated civilization, no matter what the benchmark is ostensibly saying and what level of performance you reach. If an architecture/method is insufficient, then there’s that, at any levels of benchmarks, but in that case credence that it’s insufficient should still succeed in going up in advance of this becoming clear. Then you’re back to needing the next research breakthrough that’s not just a low-hanging fruit incremental improvement (for the current methods and level of compute), which benchmark extrapolation won’t be any help in predicting.
So… no? You don’t have any other milestones or benchmarks to point to that you think are better?
Separately, I take your point that “at some point the current architecture/method can be written off as insufficient… until a new breakthrough … which takes an amount of time that’s probably not predictable...” except that actually I think it IS predictable, if we think that the current paradigm will massively accelerate AI R&D in general: “Shortly after AI R&D is massively accelerated, the next paradigm (or three) will be discovered” would be my prediction.
Also, I don’t think there’s that much more in the “new breakthrough” category needed. Like, maybe continual learning? And that’s about it? Not even sure we need it tbh.
So… no? You don’t have any other milestones or benchmarks to point to that you think are better?
I don’t agree with the methodology of using benchmarks in the way that you do (as I tried to explain), so looking for better benchmarks in this role would be beside the point.
Also, I don’t think there’s that much more in the “new breakthrough” category needed.
This sounds more like a crux, I think it’s likely that more breakthroughs are needed, that relevant milestones are about architectures/methods, different ways of making AGI work, and any benchmarks are only relevant to the extent that they predict a particular way of making AGI work succeed at a given level of compute and low hanging fruit incremental algorithmic improvements.
except that actually I think it IS predictable, if we think that the current paradigm will massively accelerate AI R&D in general
In my model gated by “breakthroughs”, accelerating incremental algorithmic improvements or surrounding engineering doesn’t particularly help, because it merely picks the low-hanging incremental algorithmic fruit faster (which is in limited supply for a given level of compute and with given methods, but currently takes human researchers years to pick). At the point where even “breakthroughs” are accelerated a lot, AI capable of full automation of civilization is probably already available.
Like, maybe continual learning? And that’s about it? Not even sure we need it tbh.
So the ways of making AGI work I see is roughly:
Just LLMs with pretraining, giving superintelligent in-context learning, making AIs able to figure out how to work around their hobblings using just in-context reasoning. This clearly doesn’t work at 2024 levels of compute (which only got to be observed in 2025), and after 2026 levels of compute natural text data starts running out. Maybe there are some relevant sparks at 2029-2031 levels of compute (5 GW), but probably not sufficient on its own.
LLMs RLVRed on specific tasks and RL environments that are manually constructed. The IMO results show this is sufficient for mildly superhuman capabilities, especially with 2026 level of pretraining compute or model size, but the RL-level capabilities remain jagged, don’t automatically generalize to arbitrary tasks and situations. Possibly RLVRing LLMs on the capability to RLVR LLMs and to construct tasks and RL environments could work around this limitation, automating development of mildly superhuman capabilities for any given topic. This possibly gives a way to 2026-2027 AGI, though I think it’s a long shot on its own. The relevant benchmarks are various automated ML engineering stuff relevant to automation of RLVRing a model on a new topic. As such benchmarks saturate and this still doesn’t work, this architecture/method is mostly ruled out, I think by about end of 2027. Or it does work.
Continual learning doesn’t seem directly relevant, but maybe it helps with teaching LLMs to RLVR LLMs. It doesn’t seem directly relevant because it probably doesn’t directly train mildly superhuman capabilities. But since the generalizing glue of adaptation would make everything work better, it might enable method (2) to go forward where it would fail on its own. The relevant benchmarks might be the same as for (2), but only if they don’t already saturate by the time there’s useful continual learning (in some nontrivial sense of the term).
I think continual learning is more likely to end up important in its impact on AI company TAM. If it manages to 10x it, then it will thereby 10x the compute, though it’ll probably take until 2030-2035 to actually build the kind of research/training compute (20-50 GW) that a trillion dollars of revenues buys (even if 2026 already shows that this is on track to happen).
???. Next word prediction RLVR and variations on this topic that make mildly superhuman RL-level capabilities (as opposed to pretraining-level capabilities) more general, trained with either natural text pretraining data or with continual learning data. Might need a lot more compute, even after/if something like this becomes more clearly a workable idea, so those 20-50 GW training systems might come in handy.
I understand you don’t like benchmark-based methodology. I think you should still answer my question, because if you did have a better benchmark, it would be valuable to me, and I asked nicely. ;) But it’s OK now I think it’s clear you don’t.
Thank you for explaining your model more. I disagree with some bits:
In my model gated by “breakthroughs”, accelerating incremental algorithmic improvements or surrounding engineering doesn’t particularly help, because it merely picks the low-hanging incremental algorithmic fruit faster (which is in limited supply for a given level of compute and with given methods, but currently takes human researchers years to pick). At the point where even “breakthroughs” are accelerated a lot, AI capable of full automation of civilization is probably already available.
The speedup from today’s coding agents is not just a within-paradigm speedup. If someone is trying to figure out how to do continual learning or brain-like AGI or whatever, they need to run experiments on GPUs as part of their research, and they’ll be able to do that faster with the help of Claude Code. Only slightly faster of course. But the point is, it’s not just a within-paradigm speedup. And it’ll get stronger over the next year or three, as the coding agents get massively better and able to succeed at longer-horizon tasks. Moreover, horizon lengths seem to be going up on most (all?) domains, not just coding; this suggests that the current paradigm will eventually automate the parts of AI R&D involved with new paradigms.
Just LLMs with pretraining, giving superintelligent in-context learning, making AIs able to figure out how to work around their hobblings using just in-context reasoning. This clearly doesn’t work at 2024 levels of compute (which only got to be observed in 2025), and after 2026 levels of compute natural text data starts running out. Maybe there are some relevant sparks at 2029-2031 levels of compute (5 GW), but probably not sufficient on its own.
I want to flag that you estimated that post-2028, we’d slow down due to the MoE data wall, and while MoE is more compute efficient, I don’t think it’s so much more compute efficient that people won’t take the hit of less compute efficiency if data efficiency becomes more relevant post-2028, and we can probably continue mostly pre-training up until the early 2030s based on data limitations alone. That said, I agree this is likely insufficient on its own, so I’m not claiming that pure LLM AGI is a likely path.
Continual learning doesn’t seem directly relevant, but maybe it helps with teaching LLMs to RLVR LLMs. It doesn’t seem directly relevant because it probably doesn’t directly train mildly superhuman capabilities. But since the generalizing glue of adaptation would make everything work better, it might enable method (2) to go forward where it would fail on its own. The relevant benchmarks might be the same as for (2), but only if they don’t already saturate by the time there’s useful continual learning (in some nontrivial sense of the term).
I think continual learning is more likely to end up important in its impact on AI company TAM. If it manages to 10x it, then it will thereby 10x the compute, though it’ll proably take until 2030-2035 to actually build the kind of research/training compute (20-50 GW) that a trillion dollars of revenues buys (even if 2026 already shows that this is on track to happen).
I think this is probably true for the near term (as in 5-10 year timelines), but over a 15-20 year span with Moore’s law, I’d be more confident that continual learning alone could plausibly train in mildly superhuman capabilities, for the same reasons why certain researchers IRL are way more productive than others.
This does depend on human level or better sample efficient learning to work out, though.
That said, I agree with your methodology much more than Daniel Kokotajlo’s methodology on AI prediction (though I do think METR time horizons are more useful and less anchored to specific paradigms than you say it is).
That’s why I said “possibly still a few years later.”
The issue I’m raising is similar to how you seem to be taking METR time horizon metrics (in their real world form as opposed to impractical idealized form, a theoretical definition of having surpassed humans) as meaningful indicators about takeoff timelines. There are lots of proxy measures that I think don’t mean anything qualitatively (about time to actual full automation of civilization) even over some orders of magnitude in capability metrics. You might acknowledge they are in some ways causally uncoupled from what drives the real takeoff milestones, but you still seem to be importantly relying on them in forecasting of timelines.
So I’m getting the same impression in the way you are framing 0% code reviewed (in contrast to 0% code written!), it just doesn’t seem very relevant in a way your writing doesn’t take a clear stance on (even as you are gesturing at measurable advance proxies being conceptually distinct from causal takeoff milestones). I think quantitative advances mostly matter only to the extent they increase pre-AGI TAM for AI companies, which increases research and training compute available to them, which compresses the time in which the researchers get to explore more of the algorithmic low-hanging fruit (in particular seeing more clearly what the more promising ideas actually amount to when scaled with serious resources).
I acknowledge that METR time horizon has loads of limitations, my position has just been that it’s the least bad benchmark to extrapolate / the best single piece of evidence we have. Do you have a better suggestion / alternative?
Similarly, re: 0% code reviewed: Seems similarly relevant to me as the 0% code written milestone. Would you agree? Do you have other milestones to point to which you think are more relevant than either?
I think benchmark extrapolation primarily helps with figuring out how far you can go for a given architecture/method and level of compute, and to estimate what level of TAM it makes available to build more compute. Benchmarks respond both to scaling and to incremental algorithmic improvements, so there’s a feedback loop for any given toolset of methods. But at some point the current architecture/method can be written off as insufficient to fully automate civilization until a new breakthrough sufficiently changes something. Which takes an amount of time that’s probably not predictable from benchmark extrapolation for the previous methods.
There’s uncertainty about whether any given architecture/method is sufficient to fully automate civilization, and it gets gradually resolved as it gets scaled closer to the limits of available compute and as a few years pass with a given level of compute to explore incremental algorithmic improvements. Benchmarks are always extremely deceptive about whether you’ll get all the way to fully automated civilization, no matter what the benchmark is ostensibly saying and what level of performance you reach. If an architecture/method is insufficient, then there’s that, at any levels of benchmarks, but in that case credence that it’s insufficient should still succeed in going up in advance of this becoming clear. Then you’re back to needing the next research breakthrough that’s not just a low-hanging fruit incremental improvement (for the current methods and level of compute), which benchmark extrapolation won’t be any help in predicting.
So… no? You don’t have any other milestones or benchmarks to point to that you think are better?
Separately, I take your point that “at some point the current architecture/method can be written off as insufficient… until a new breakthrough … which takes an amount of time that’s probably not predictable...” except that actually I think it IS predictable, if we think that the current paradigm will massively accelerate AI R&D in general: “Shortly after AI R&D is massively accelerated, the next paradigm (or three) will be discovered” would be my prediction.
Also, I don’t think there’s that much more in the “new breakthrough” category needed. Like, maybe continual learning? And that’s about it? Not even sure we need it tbh.
I don’t agree with the methodology of using benchmarks in the way that you do (as I tried to explain), so looking for better benchmarks in this role would be beside the point.
This sounds more like a crux, I think it’s likely that more breakthroughs are needed, that relevant milestones are about architectures/methods, different ways of making AGI work, and any benchmarks are only relevant to the extent that they predict a particular way of making AGI work succeed at a given level of compute and low hanging fruit incremental algorithmic improvements.
In my model gated by “breakthroughs”, accelerating incremental algorithmic improvements or surrounding engineering doesn’t particularly help, because it merely picks the low-hanging incremental algorithmic fruit faster (which is in limited supply for a given level of compute and with given methods, but currently takes human researchers years to pick). At the point where even “breakthroughs” are accelerated a lot, AI capable of full automation of civilization is probably already available.
So the ways of making AGI work I see is roughly:
Just LLMs with pretraining, giving superintelligent in-context learning, making AIs able to figure out how to work around their hobblings using just in-context reasoning. This clearly doesn’t work at 2024 levels of compute (which only got to be observed in 2025), and after 2026 levels of compute natural text data starts running out. Maybe there are some relevant sparks at 2029-2031 levels of compute (5 GW), but probably not sufficient on its own.
LLMs RLVRed on specific tasks and RL environments that are manually constructed. The IMO results show this is sufficient for mildly superhuman capabilities, especially with 2026 level of pretraining compute or model size, but the RL-level capabilities remain jagged, don’t automatically generalize to arbitrary tasks and situations. Possibly RLVRing LLMs on the capability to RLVR LLMs and to construct tasks and RL environments could work around this limitation, automating development of mildly superhuman capabilities for any given topic. This possibly gives a way to 2026-2027 AGI, though I think it’s a long shot on its own. The relevant benchmarks are various automated ML engineering stuff relevant to automation of RLVRing a model on a new topic. As such benchmarks saturate and this still doesn’t work, this architecture/method is mostly ruled out, I think by about end of 2027. Or it does work.
Continual learning doesn’t seem directly relevant, but maybe it helps with teaching LLMs to RLVR LLMs. It doesn’t seem directly relevant because it probably doesn’t directly train mildly superhuman capabilities. But since the generalizing glue of adaptation would make everything work better, it might enable method (2) to go forward where it would fail on its own. The relevant benchmarks might be the same as for (2), but only if they don’t already saturate by the time there’s useful continual learning (in some nontrivial sense of the term).
I think continual learning is more likely to end up important in its impact on AI company TAM. If it manages to 10x it, then it will thereby 10x the compute, though it’ll probably take until 2030-2035 to actually build the kind of research/training compute (20-50 GW) that a trillion dollars of revenues buys (even if 2026 already shows that this is on track to happen).
???. Next word prediction RLVR and variations on this topic that make mildly superhuman RL-level capabilities (as opposed to pretraining-level capabilities) more general, trained with either natural text pretraining data or with continual learning data. Might need a lot more compute, even after/if something like this becomes more clearly a workable idea, so those 20-50 GW training systems might come in handy.
I understand you don’t like benchmark-based methodology. I think you should still answer my question, because if you did have a better benchmark, it would be valuable to me, and I asked nicely. ;) But it’s OK now I think it’s clear you don’t.
Thank you for explaining your model more. I disagree with some bits:
The speedup from today’s coding agents is not just a within-paradigm speedup. If someone is trying to figure out how to do continual learning or brain-like AGI or whatever, they need to run experiments on GPUs as part of their research, and they’ll be able to do that faster with the help of Claude Code. Only slightly faster of course. But the point is, it’s not just a within-paradigm speedup. And it’ll get stronger over the next year or three, as the coding agents get massively better and able to succeed at longer-horizon tasks. Moreover, horizon lengths seem to be going up on most (all?) domains, not just coding; this suggests that the current paradigm will eventually automate the parts of AI R&D involved with new paradigms.
I want to flag that you estimated that post-2028, we’d slow down due to the MoE data wall, and while MoE is more compute efficient, I don’t think it’s so much more compute efficient that people won’t take the hit of less compute efficiency if data efficiency becomes more relevant post-2028, and we can probably continue mostly pre-training up until the early 2030s based on data limitations alone. That said, I agree this is likely insufficient on its own, so I’m not claiming that pure LLM AGI is a likely path.
I think this is probably true for the near term (as in 5-10 year timelines), but over a 15-20 year span with Moore’s law, I’d be more confident that continual learning alone could plausibly train in mildly superhuman capabilities, for the same reasons why certain researchers IRL are way more productive than others.
This does depend on human level or better sample efficient learning to work out, though.
That said, I agree with your methodology much more than Daniel Kokotajlo’s methodology on AI prediction (though I do think METR time horizons are more useful and less anchored to specific paradigms than you say it is).