I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from generically “August”, and a 3% score difference is not necessarily insignificant (perhaps there are diminishing returns at >80%). I agree that we are quibbling over a thing that does not in itself matter much, but it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor. To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
The signal that I am waiting for to assess very short timelines is primarily METR task lengths.
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from genetically “August”
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I am amused that we are, with perfect seriousness, discussing the dates for the singularity with a resolution of two weeks. I’m an old guy; I remember when the date for the singularity was “in the twenty first century sometime.” For 50 years, predictions have been getting sharper and sharper. The first time I saw a prediction that discussed time in terms of quarters instead of years, it took my breath away. And that was a couple of years ago now.
Of course it was clear decades ago that as the singularity approached, we have a better and better idea of its timing and contours. It’s neat to see it happen in real life.
(I know “the singularity” is disfavored, vaguely mystical, twentieth century terminology. But I’m using it to express solidarity with my 1992 self, who thought with that word.)
I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from generically “August”, and a 3% score difference is not necessarily insignificant (perhaps there are diminishing returns at >80%). I agree that we are quibbling over a thing that does not in itself matter much, but it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor. To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
The signal that I am waiting for to assess very short timelines is primarily METR task lengths.
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I appreciate the constructive responses!
I am amused that we are, with perfect seriousness, discussing the dates for the singularity with a resolution of two weeks. I’m an old guy; I remember when the date for the singularity was “in the twenty first century sometime.” For 50 years, predictions have been getting sharper and sharper. The first time I saw a prediction that discussed time in terms of quarters instead of years, it took my breath away. And that was a couple of years ago now.
Of course it was clear decades ago that as the singularity approached, we have a better and better idea of its timing and contours. It’s neat to see it happen in real life.
(I know “the singularity” is disfavored, vaguely mystical, twentieth century terminology. But I’m using it to express solidarity with my 1992 self, who thought with that word.)