It looks like AI 2027 was posted on April 3rd, 2025?
In that case, August was about 4 months away, which means late September is 20-25% slower than projected and we are still a few percentage points short—seems reasonable to expect the scores they predicted sometime in October or November, but that is still say 40-50% over their prediction.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode (EDIT: median) scenario, which makes doing this kind of evaluation annoying, but I would have to say that things look significantly behind the specific story in that piece. The reason I am saying this is that it is a bit of an overstatement to praise their predictive accuracy on mid-2025 predictions which they made in early-mid 2025, when their predictive accuracy is off on the scale of a month or two, and their predictionsfor 2025were not viewed as particularly radical or unexpected at the time as far as I remember. It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
(I believe I did myself leave a comment like “I expect this to start not happening right away” but in follow-up conversation specified that I was not talking about 2025).
Still, I appreciate that you are checking in on the accuracy of their story with real numbers.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode scenario, which makes doing this kind of evaluation annoying,
We’ve said that it was faster than our median, not our mode. I think it was close to most of our modes at the time of publication, mostly we were at around 2027-2028.
But the evaluation itself seems useful either way, in terms of checking in on how things are going relative to the trajectory that was our best guess conditional on the AGI timelines depicted.
Small point of information—as I’ve heard Daniel (and maybe you?) explain, the ‘mode’ in your language means repeatedly thinking ‘what might happen next, what is most likely’ and sampling forward in that way.
But that’ll systematically underestimate the overall mode of a series of such nonnegative, right-skewed distributions (which would tend to look more like the median).
So I think it could be worth being a bit pedantic about how you describe this.
I was only referring to our AI timelines mode, in this case it’s defined as the most likely year in which superhuman coder arrives.
In general the concept of mode for most of the scenario decisions seems not well defined as e.g. for non-naturally-numeric choices it depends on how you define the categories and what past events you condition on (for the timelines mode we’re conditioning on the starting point but in other cases one might condition on all events thus far).
I would personally describe our process as some mixture of sampling what intuitively feels most likely at each point (which might e.g. correspond to the mode of a natural categorical breakdown or of a distribution conditional on all events thus far, but we mostly didn’t explicitly calculate this), while also optimizing for making things not too degenerate and overall intuitively feel like a plausible trajectory (because by default doing mode every time would look unlike what we actually expect in some sense, because in the real world there will be many surprises).
As an example of how much definitions matter here, if we just conditioned on the previous conditions for each month and sampled what big algorithmic improvements might happen treating this as a categorical variable which enumerated many possible improvements, we might never end up with any specific algorithmic improvements or end up with them quite late in the game. But if we instead assume that we think overall probably some will come before superhuman coder and then pick what we think are the most likely ones even though any individual one may be <50% this quickly (though not totally clear in this case) and <<50% in any individual month, then we end up with neuralese recurrence and shared memory bank right before SC.
Perhaps a simpler example of how categorization matters is that if we break down possible AIs’ goals very granularly then we have the most peobabilities of AIs being very well aligned, relative to any very specific misaligned goal. But we overall have more probability on misalignment in this scenario so we first make that high level choice, then we choose one of the most likely specific misaligned goals.
I want to say this more clearly and simply somehow. Something like ‘adding up a series of conditional modes does not give the overall mode’? (And for nonnegative, right-skewed things like timelines, it’ll systematically underestimate except maybe in the presence of very particular correlations?)
Here’s a try at phrasing it with less probability jargon:
The forecast contains a number of steps, all of which are assumed to take our best estimate of their most likely time. But in reality, unless we’re very lucky, some of those steps will be faster than predicted, and some will be slower. The ones that are faster can only be so much faster (because they can’t take no time at all). On the other hand, the ones that are slower can be much slower. So the net effect of this uncertainty probably adds up to a slowdown relative to the prediction.
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn’t remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized energy-based models, an extremely broad class, because you should sample from them to normalize; maximum probability estimates can be dangerous.)
Statistical mechanics points to the solution: knowing the most likely microstate of a box of particles doesn’t tell you anything; physicists care about macrostates, which are observables. You define a statistic (any function of the data, which somehow summarizes it) which you actually care about, and then take the mode of that. For example, number of breakthrough discoveries by time t.
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.
It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
As someone who could perhaps be termed as such, my expectations regarding the technical side of things only start to significantly diverge at the start of 2027. (I’m not certain of Agent-1 1.5x’ing AI research speed, but I can see that.[1] The rest seems more or less priced-in.) And indeed, the end of 2026 is the point where, the forecast itself admits, its uncertainty increases and its predictions get less grounded.
Specifically, the point where I get off the ride is this one:
OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms). While the latest Agent-1 could double the pace of OpenBrain’s algorithmic progress, Agent-2 can now triple it, and will improve further with time.
My understanding is that Agent-2 essentially “closes the loop” on automated AI R&D, and while human input is still useful due to worse taste, it’s no longer required. That’s the part that seems like a “jump” to me, not a common-sensical extrapolation, and which I mostly expect not to happen.
Because I am really confused about how much AI is accelerating research/programming now, so I have no idea what number to extrapolate. Maybe it gets so good at fooling people into thinking they’re being incredibly productive by managing 50 agents at once that it slows research down by 50% instead?
Out of my own curiosity, if the real world plays out as you anticipate, and agent-2 does not close the loop, how much further back does that delay your timelines? Do you think that something like agent-3 or agent-4 could close the loop, or do you think it is further off than even that?
I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from generically “August”, and a 3% score difference is not necessarily insignificant (perhaps there are diminishing returns at >80%). I agree that we are quibbling over a thing that does not in itself matter much, but it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor. To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
The signal that I am waiting for to assess very short timelines is primarily METR task lengths.
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from genetically “August”
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I am amused that we are, with perfect seriousness, discussing the dates for the singularity with a resolution of two weeks. I’m an old guy; I remember when the date for the singularity was “in the twenty first century sometime.” For 50 years, predictions have been getting sharper and sharper. The first time I saw a prediction that discussed time in terms of quarters instead of years, it took my breath away. And that was a couple of years ago now.
Of course it was clear decades ago that as the singularity approached, we have a better and better idea of its timing and contours. It’s neat to see it happen in real life.
(I know “the singularity” is disfavored, vaguely mystical, twentieth century terminology. But I’m using it to express solidarity with my 1992 self, who thought with that word.)
It looks like AI 2027 was posted on April 3rd, 2025?
In that case, August was about 4 months away, which means late September is 20-25% slower than projected and we are still a few percentage points short—seems reasonable to expect the scores they predicted sometime in October or November, but that is still say 40-50% over their prediction.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode (EDIT: median) scenario, which makes doing this kind of evaluation annoying, but I would have to say that things look significantly behind the specific story in that piece. The reason I am saying this is that it is a bit of an overstatement to praise their predictive accuracy on mid-2025 predictions which they made in early-mid 2025, when their predictive accuracy is off on the scale of a month or two, and their predictions for 2025 were not viewed as particularly radical or unexpected at the time as far as I remember. It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
(I believe I did myself leave a comment like “I expect this to start not happening right away” but in follow-up conversation specified that I was not talking about 2025).
Still, I appreciate that you are checking in on the accuracy of their story with real numbers.
We’ve said that it was faster than our median, not our mode. I think it was close to most of our modes at the time of publication, mostly we were at around 2027-2028.
But the evaluation itself seems useful either way, in terms of checking in on how things are going relative to the trajectory that was our best guess conditional on the AGI timelines depicted.
Small point of information—as I’ve heard Daniel (and maybe you?) explain, the ‘mode’ in your language means repeatedly thinking ‘what might happen next, what is most likely’ and sampling forward in that way.
But that’ll systematically underestimate the overall mode of a series of such nonnegative, right-skewed distributions (which would tend to look more like the median).
So I think it could be worth being a bit pedantic about how you describe this.
I was only referring to our AI timelines mode, in this case it’s defined as the most likely year in which superhuman coder arrives.
In general the concept of mode for most of the scenario decisions seems not well defined as e.g. for non-naturally-numeric choices it depends on how you define the categories and what past events you condition on (for the timelines mode we’re conditioning on the starting point but in other cases one might condition on all events thus far).
I would personally describe our process as some mixture of sampling what intuitively feels most likely at each point (which might e.g. correspond to the mode of a natural categorical breakdown or of a distribution conditional on all events thus far, but we mostly didn’t explicitly calculate this), while also optimizing for making things not too degenerate and overall intuitively feel like a plausible trajectory (because by default doing mode every time would look unlike what we actually expect in some sense, because in the real world there will be many surprises).
As an example of how much definitions matter here, if we just conditioned on the previous conditions for each month and sampled what big algorithmic improvements might happen treating this as a categorical variable which enumerated many possible improvements, we might never end up with any specific algorithmic improvements or end up with them quite late in the game. But if we instead assume that we think overall probably some will come before superhuman coder and then pick what we think are the most likely ones even though any individual one may be <50% this quickly (though not totally clear in this case) and <<50% in any individual month, then we end up with neuralese recurrence and shared memory bank right before SC.
Perhaps a simpler example of how categorization matters is that if we break down possible AIs’ goals very granularly then we have the most peobabilities of AIs being very well aligned, relative to any very specific misaligned goal. But we overall have more probability on misalignment in this scenario so we first make that high level choice, then we choose one of the most likely specific misaligned goals.
I want to say this more clearly and simply somehow. Something like ‘adding up a series of conditional modes does not give the overall mode’? (And for nonnegative, right-skewed things like timelines, it’ll systematically underestimate except maybe in the presence of very particular correlations?)
Here’s a try at phrasing it with less probability jargon:
The forecast contains a number of steps, all of which are assumed to take our best estimate of their most likely time. But in reality, unless we’re very lucky, some of those steps will be faster than predicted, and some will be slower. The ones that are faster can only be so much faster (because they can’t take no time at all). On the other hand, the ones that are slower can be much slower. So the net effect of this uncertainty probably adds up to a slowdown relative to the prediction.
Does that seem like a fair summary?
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn’t remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized energy-based models, an extremely broad class, because you should sample from them to normalize; maximum probability estimates can be dangerous.)
Statistical mechanics points to the solution: knowing the most likely microstate of a box of particles doesn’t tell you anything; physicists care about macrostates, which are observables. You define a statistic (any function of the data, which somehow summarizes it) which you actually care about, and then take the mode of that. For example, number of breakthrough discoveries by time t.
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.
As someone who could perhaps be termed as such, my expectations regarding the technical side of things only start to significantly diverge at the start of 2027. (I’m not certain of Agent-1 1.5x’ing AI research speed, but I can see that.[1] The rest seems more or less priced-in.) And indeed, the end of 2026 is the point where, the forecast itself admits, its uncertainty increases and its predictions get less grounded.
Specifically, the point where I get off the ride is this one:
My understanding is that Agent-2 essentially “closes the loop” on automated AI R&D, and while human input is still useful due to worse taste, it’s no longer required. That’s the part that seems like a “jump” to me, not a common-sensical extrapolation, and which I mostly expect not to happen.
Because I am really confused about how much AI is accelerating research/programming now, so I have no idea what number to extrapolate. Maybe it gets so good at fooling people into thinking they’re being incredibly productive by managing 50 agents at once that it slows research down by 50% instead?
Out of my own curiosity, if the real world plays out as you anticipate, and agent-2 does not close the loop, how much further back does that delay your timelines? Do you think that something like agent-3 or agent-4 could close the loop, or do you think it is further off than even that?
I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from generically “August”, and a 3% score difference is not necessarily insignificant (perhaps there are diminishing returns at >80%). I agree that we are quibbling over a thing that does not in itself matter much, but it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor. To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
The signal that I am waiting for to assess very short timelines is primarily METR task lengths.
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I appreciate the constructive responses!
I am amused that we are, with perfect seriousness, discussing the dates for the singularity with a resolution of two weeks. I’m an old guy; I remember when the date for the singularity was “in the twenty first century sometime.” For 50 years, predictions have been getting sharper and sharper. The first time I saw a prediction that discussed time in terms of quarters instead of years, it took my breath away. And that was a couple of years ago now.
Of course it was clear decades ago that as the singularity approached, we have a better and better idea of its timing and contours. It’s neat to see it happen in real life.
(I know “the singularity” is disfavored, vaguely mystical, twentieth century terminology. But I’m using it to express solidarity with my 1992 self, who thought with that word.)