For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn’t currently known to be based on the same pretrained model as o1.
The AI 2027 story heavily leans into RL training taking off promptly, and it’s possible they are resonating with some insider rumors grounded in reality, but from my point of view it’s too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.
If you click through the link from @kave, you’ll see the authors are prioritizing bets with clear resolution criteria. That’s why I chose the statement I made—it’ll initially be hard to tell whether AI agents are more or less useless than this essay proposes they will be.
I mean its not like they shy away from concrete predictions. Eg their first prediction is
We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
Edit: oh wait nevermind their first prediction is actually
Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).
Can you give something specific? It seems like pretty much every statement has a footnote grounding the relevant high-level claim in low-level indicators, and in cases where that’s not the case, those predictions often seem clear derivatives of precise claims in eg their compute forecast
I’m not saying there are no precise claims about the near future, only that I haven’t made up my mind about those precise claims. For instance, my only active disagreement with the mid-2025 section is that it gives me the impression that LLM agents will be seeing more widespread use than I expect. There are specific claims, like a prediction about SWE bench performance, but I don’t trust SWE bench as a measure of progress towards AGI, and I can’t determine at a glance whether their number is too high or too low.
The later sections are full of predictions that I expect to fail indisputably. The most salient is that AI engineers are supposed to be obselete in like 2 years.
For what its worth, my view is that we’re very likely to be wrong about the specific details in both of the endings—they are obviously super conjunctive. I don’t think that there’s any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.
Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.
Is your disagreement that you think AGI will happen later, or that you think the effects of AGI on the world will look very different, or both? If its just the timelines, we might have fairly similar views.
My main disagreement is the speed, but not because I expect everything to happen more slowly by some constant factor. Instead I think there’s a missing mood here regarding the obstacles to building AGI, and the time to overcome those obstacles is not clear (which is why my timeline uncertainty is still ~in the exponent).
In particular, I think the first serious departure from my model of LLMs (linked above) is the neuralese section. It seems to me that for this to really work (in a way comparable to how human brains have recurrence) would require another breakthrough at least on the level of transformers if not harder. So, if the paper from Hao et al. is actually followed up on by future research that successfully scales, that would be a crux for me. Your explanation that the frontier labs haven’t adopted this for GPU utilization reasons seems highly implausible to me. These are creative people who want to ready AGI, and it seems obvious that the kind of tasks that arent conquered yet look a lot like the ones that need recurrence. Do you really think none of them have significantly invested in this (starting years ago when it become obvious this was a bottleneck)? The fact that we still need CoT at all tells me neuralese is not happening because we don’t know how to do it. Please refer to my post for more details on this intuition and its implications. In particular, I am not convinced this is the final bottleneck.
I also depart from certain other details latter, for instance I think we’ll have better theory by the time we need to align human level AI and “muddling through” by blind experimentation probably won’t work or be the actual path taken by surviving worlds.
My other points of disagreement seem less cruxy and are mostly downstream.
Re the recurrence/memory aspect, you might like this new paper which actually figured out how to use recurrent architectures to make a 1 minute Tom and Jerry cartoon video that was reasonably consistent, and in the tweet below, argues that somehow they managed to fix the training problems that come from training vanilla RNNs:
A note is that I actually expect AI progress to slow down for at least a year, and potentially up to 4-5 years due to the tariffs inducing a recession, but this doesn’t matter for the debate on whether LLMs can get to AGI.
I agree with the view that recurrence/hidden states would be a game-changer if they worked, because it allows the LLM to have a memory, and memoryless humans are way, way less employable than people who have memory, because it’s much easier to meta-learn strategies with memory.
That said, I’m both uncertain on the view that recurrence is necessary to get LLMs to learn better/have a memory/state that lasts beyond the context window, and also think that meta-learning over long periods/having a memory is probably the only hard bottleneck at this point that might not be solved (but is likely to be solved, if these new papers are anything to go by).
I basically agree with @gwern’s explanation of what LLMs are missing that makes them not AGIs (at least without a further couple of OOMs at the very least, and the worst case is they need exponential compute to get linear gains):
I only think one intervention is basically necessary at most, and one could argue that 0 new insights are needed.
The other part here is I basically disagree with this assumption, and more generally I have a strong prior that a lot of problems are solved by muddling through/using semi-dumb strategies that work way better than they have any right to:
I also depart from certain other details latter, for instance I think we’ll have better theory by the time we need to align human level AI and “muddling through” by blind experimentation probably won’t work or be the actual path taken by surviving worlds.
I think most worlds that survive AGI to ASI for at least 2 years, if not longer, will almost certainly include a lot of dropped balls and fairly blind experimentation (helped out by the AI control agenda), as well as the world’s offense-defense balance shifting to a more defensive equilibrium.
I do think most of my probability mass for AI that can automate all AI research is in the 2030s, but this is broadly due to the tariffs and scaling up new innovations taking some time, rather than the difficulty of AGI being high.
Edit: @Vladimir_Nesov has convinced me that the tariffs delay stuff only slightly, though my issue is with the tariffs causing an economic recession, causing AI investment to fall quite a bit for a while.
probability mass for AI that can automate all AI research is in the 2030s … broadly due to the tariffs and …
Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.
Thanks. I’ve submitted my own post on the ‘change our mind form’, though I’m not expecting a bounty. I’d instead be interested in making a much bigger bet (bigger than Cole’s 100 USD), gonna think about what resolution criterion is best.
Can you please sketch a scenario, in as much detail as you can afford, about how you think the next year or three will go? That way we can judge whether reality was closer to AI-2027 or to your scenario. (If you don’t do this, then when inevitably AI-2027 gets some things wrong and some things right, it’ll be hard to judge if you or I were right and confirmation bias will tempt us both.)
Sure, I suppose that now I’ve started recklessly speculating about the future I might as well follow through.
I expect the departure to be pretty clear though, because we won’t see superhuman ai engineers before 2030. Even that prediction needs to be operationalized a bit of course.
Great, thanks! You are off to a good start, since I’m predicting superhuman autonomous AI coders by 2030 (and in fact, I’d say 50% by mid-2028 nowadays) whereas you are predicting that won’t happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR’s agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today’s SOTA?
FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
Yes. Though, I find it a bit hard to visualize a 4 hour software engineering task that can’t be done in 1 hour, so I’m more clear on there not being a 16x or so improvement in 2 years
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.
Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
And here are the projected horizons:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?
I expect this to start not happening right away.
So at least we’ll see who’s right soon.
For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn’t currently known to be based on the same pretrained model as o1.
The AI 2027 story heavily leans into RL training taking off promptly, and it’s possible they are resonating with some insider rumors grounded in reality, but from my point of view it’s too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.
They’re looking to make bets with people who disagree. Could be a good opportunity to get some expected dollars
Sure, I’ll keep it simple (will submit through proper channels later):
Here’s my attempt to change their minds: https://www.lesswrong.com/posts/vvgND6aLjuDR6QzDF/my-model-of-what-is-going-on-with-llms
I’ll bet 100 USD that by 2027 AI agents have not replaced human AI engineers. If it’s hard to decide I’ll pay 50 USD.
This seems a pretty big backpedal from “I expect this to start not happening right away.”
If you click through the link from @kave, you’ll see the authors are prioritizing bets with clear resolution criteria. That’s why I chose the statement I made—it’ll initially be hard to tell whether AI agents are more or less useless than this essay proposes they will be.
I mean its not like they shy away from concrete predictions. Eg their first prediction is
Edit: oh wait nevermind their first prediction is actually
Yeah, I guess that the early statements I disagree with at a glance are less specific, and later there are very specific claims I disagree with.
I can see how this would seem incongruous with my initial comment.
Can you give something specific? It seems like pretty much every statement has a footnote grounding the relevant high-level claim in low-level indicators, and in cases where that’s not the case, those predictions often seem clear derivatives of precise claims in eg their compute forecast
I’m not saying there are no precise claims about the near future, only that I haven’t made up my mind about those precise claims. For instance, my only active disagreement with the mid-2025 section is that it gives me the impression that LLM agents will be seeing more widespread use than I expect. There are specific claims, like a prediction about SWE bench performance, but I don’t trust SWE bench as a measure of progress towards AGI, and I can’t determine at a glance whether their number is too high or too low.
The later sections are full of predictions that I expect to fail indisputably. The most salient is that AI engineers are supposed to be obselete in like 2 years.
For what its worth, my view is that we’re very likely to be wrong about the specific details in both of the endings—they are obviously super conjunctive. I don’t think that there’s any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.
Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.
Is your disagreement that you think AGI will happen later, or that you think the effects of AGI on the world will look very different, or both? If its just the timelines, we might have fairly similar views.
My main disagreement is the speed, but not because I expect everything to happen more slowly by some constant factor. Instead I think there’s a missing mood here regarding the obstacles to building AGI, and the time to overcome those obstacles is not clear (which is why my timeline uncertainty is still ~in the exponent).
In particular, I think the first serious departure from my model of LLMs (linked above) is the neuralese section. It seems to me that for this to really work (in a way comparable to how human brains have recurrence) would require another breakthrough at least on the level of transformers if not harder. So, if the paper from Hao et al. is actually followed up on by future research that successfully scales, that would be a crux for me. Your explanation that the frontier labs haven’t adopted this for GPU utilization reasons seems highly implausible to me. These are creative people who want to ready AGI, and it seems obvious that the kind of tasks that arent conquered yet look a lot like the ones that need recurrence. Do you really think none of them have significantly invested in this (starting years ago when it become obvious this was a bottleneck)? The fact that we still need CoT at all tells me neuralese is not happening because we don’t know how to do it. Please refer to my post for more details on this intuition and its implications. In particular, I am not convinced this is the final bottleneck.
I also depart from certain other details latter, for instance I think we’ll have better theory by the time we need to align human level AI and “muddling through” by blind experimentation probably won’t work or be the actual path taken by surviving worlds.
My other points of disagreement seem less cruxy and are mostly downstream.
Re the recurrence/memory aspect, you might like this new paper which actually figured out how to use recurrent architectures to make a 1 minute Tom and Jerry cartoon video that was reasonably consistent, and in the tweet below, argues that somehow they managed to fix the training problems that come from training vanilla RNNs:
https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf
https://arxiv.org/abs/2407.04620
https://x.com/karansdalal/status/1810377853105828092 (This is the tweet I pointed to for the claim that they solved the issue of training vanilla RNNs):
https://x.com/karansdalal/status/1909312851795411093 (Previous work that is relevant)
https://x.com/karansdalal/status/1909312851795411093 (Tweet of the current paper)
A note is that I actually expect AI progress to slow down for at least a year, and potentially up to 4-5 years due to the tariffs inducing a recession, but this doesn’t matter for the debate on whether LLMs can get to AGI.
I agree with the view that recurrence/hidden states would be a game-changer if they worked, because it allows the LLM to have a memory, and memoryless humans are way, way less employable than people who have memory, because it’s much easier to meta-learn strategies with memory.
That said, I’m both uncertain on the view that recurrence is necessary to get LLMs to learn better/have a memory/state that lasts beyond the context window, and also think that meta-learning over long periods/having a memory is probably the only hard bottleneck at this point that might not be solved (but is likely to be solved, if these new papers are anything to go by).
I basically agree with @gwern’s explanation of what LLMs are missing that makes them not AGIs (at least without a further couple of OOMs at the very least, and the worst case is they need exponential compute to get linear gains):
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/?commentId=hSkQG2N8rkKXosLEF
I only think one intervention is basically necessary at most, and one could argue that 0 new insights are needed.
The other part here is I basically disagree with this assumption, and more generally I have a strong prior that a lot of problems are solved by muddling through/using semi-dumb strategies that work way better than they have any right to:
I think most worlds that survive AGI to ASI for at least 2 years, if not longer, will almost certainly include a lot of dropped balls and fairly blind experimentation (helped out by the AI control agenda), as well as the world’s offense-defense balance shifting to a more defensive equilibrium.
I do think most of my probability mass for AI that can automate all AI research is in the 2030s, but this is broadly due to the tariffs and scaling up new innovations taking some time, rather than the difficulty of AGI being high.
Edit: @Vladimir_Nesov has convinced me that the tariffs delay stuff only slightly, though my issue is with the tariffs causing an economic recession, causing AI investment to fall quite a bit for a while.
Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.
Thanks. I’ve submitted my own post on the ‘change our mind form’, though I’m not expecting a bounty. I’d instead be interested in making a much bigger bet (bigger than Cole’s 100 USD), gonna think about what resolution criterion is best.
Can you please sketch a scenario, in as much detail as you can afford, about how you think the next year or three will go? That way we can judge whether reality was closer to AI-2027 or to your scenario. (If you don’t do this, then when inevitably AI-2027 gets some things wrong and some things right, it’ll be hard to judge if you or I were right and confirmation bias will tempt us both.)
Sure, I suppose that now I’ve started recklessly speculating about the future I might as well follow through.
I expect the departure to be pretty clear though, because we won’t see superhuman ai engineers before 2030. Even that prediction needs to be operationalized a bit of course.
Great, thanks! You are off to a good start, since I’m predicting superhuman autonomous AI coders by 2030 (and in fact, I’d say 50% by mid-2028 nowadays) whereas you are predicting that won’t happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR’s agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today’s SOTA?
FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
Yes. Though, I find it a bit hard to visualize a 4 hour software engineering task that can’t be done in 1 hour, so I’m more clear on there not being a 16x or so improvement in 2 years
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.
Similarly, if the METR trend continues I will become very worried that AGI is near.
So far, METR seems to believe horizons are growing even faster than expected: https://metr.github.io/autonomy-evals-guide/openai-o3-report/
Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?