Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
I disagree this is close for several reasons.
It isn’t clear that the “parallel test time” number even counts.
My understanding is these benchmarks can’t be achieved by using mechanisms that cost more in compute than a human to manually perform and we have no idea how much parallel attempts are sampled. They use up to 256 in their post on GPQA
It uses an internal scoring model that might not generalize beyond the repos swe-bench tests.
Sonnet 3.7′s 70.3% score did not exist on swebench.com at the point ai-2027 was released (highest was 65.4%), suggesting the authors were not anchoring from that parallel test time number to begin with.
If parallel test time does count, projection is not close:
A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
Looking more recently, the growth from May’s Sonnet 4 with parallel compute to now (4 months later) has been 1.8%. At this rate assuming linearity, 85% won’t be crossed for nearly 7 months from now, which is over 60% slower than projection.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified. We might be at more like a 55.6% SOTA today (GTA1 w/ GPT-5) on OG osworld, a huge miss (~46% slower)
Overall, realized data suggests something more like an ai-2029 or even later.
It isn’t clear that the “parallel test time” number even counts.
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
If parallel test time does count, projection is not close:
A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
I wrote another comment about this general idea, but the highlights from my response are:
We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
With that in mind, I think that it’s still a fairly reasonable prediction, particularly when predicting something with exponential growth. On top of that, we don’t really have alternate predictions to judge against. Nonetheless, I think you are right that particularly this benchmark is behind what was projected by AI-2027. I am just not sure I believe 25%-33% behind is significant.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified.
This is an oversight on my part, and you are right to point out that this originally referred to a different benchmark. However, upon further research, I am not sure the extrapolation you draw from this, which is that the new osworld-verified is substantially easier than the old osworld, is true. OpenAI’s operator agent actually declined in score (from 38% originally to 31% now). While the old test used 200 steps, vs the new test using 100 steps, Operator only improved by 0.1% when being given 100 steps instead of 50 steps on the osworld-verified, so I don’t think that this matters.
All of this is to say, some models scores improved on the osworld-verified, and some declined in score. The redesign to osworld-verified was because the original test had bugs, not in order to make a brand new test (otherwise they would still be tracking the old benchmark). The osworld-verified is the spiritual successor to the osworld-verified, and knowledgeable human performance on the benchmark remains around 70%. I think for all intents and purposes, it is worth treating as the same benchmark, though I definitely will update my post soon to reflect that the benchmark changed since AI-2027 was written.
Finally, while researching the osworld benchmark, I discovered that in the past few days, a new high score was achieved by agent s3 w/ GPT-5 bBoN (N=10). The resulting score was 70%, which is human level performance, and it was achieved on October 3rd, 2025. I will also update my post to reflect that at the very beginning of October, a higher score than was projected for August was achieved on the osworld-verified.
. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That’s enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates—this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven’t updated their abstract claiming 72.36% (which was from the old version) and I’m wondering if they simply haven’t re-evaluated.
But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.
All said, I trust swe-bench-verified more regardless to track progress:
We’re relying on a well-made benchmark that was done as a second pass by OpenAI. os-world is not that.
Labs seem to be targetting more—low hanging fuit like attaching python interpreters just doesn’t exist for this benchmark ( I’m not sure if the ai-2027 considered this issue when making their os-world predictions)..
We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.
I disagree this is close for several reasons.
It isn’t clear that the “parallel test time” number even counts.
My understanding is these benchmarks can’t be achieved by using mechanisms that cost more in compute than a human to manually perform and we have no idea how much parallel attempts are sampled. They use up to 256 in their post on GPQA
It uses an internal scoring model that might not generalize beyond the repos swe-bench tests.
Sonnet 3.7′s 70.3% score did not exist on swebench.com at the point ai-2027 was released (highest was 65.4%), suggesting the authors were not anchoring from that parallel test time number to begin with.
If parallel test time does count, projection is not close:
A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
Looking more recently, the growth from May’s Sonnet 4 with parallel compute to now (4 months later) has been 1.8%. At this rate assuming linearity, 85% won’t be crossed for nearly 7 months from now, which is over 60% slower than projection.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified. We might be at more like a 55.6% SOTA today (GTA1 w/ GPT-5) on OG osworld, a huge miss (~46% slower)
Overall, realized data suggests something more like an ai-2029 or even later.
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
I wrote another comment about this general idea, but the highlights from my response are:
With that in mind, I think that it’s still a fairly reasonable prediction, particularly when predicting something with exponential growth. On top of that, we don’t really have alternate predictions to judge against. Nonetheless, I think you are right that particularly this benchmark is behind what was projected by AI-2027. I am just not sure I believe 25%-33% behind is significant.
This is an oversight on my part, and you are right to point out that this originally referred to a different benchmark. However, upon further research, I am not sure the extrapolation you draw from this, which is that the new osworld-verified is substantially easier than the old osworld, is true. OpenAI’s operator agent actually declined in score (from 38% originally to 31% now). While the old test used 200 steps, vs the new test using 100 steps, Operator only improved by 0.1% when being given 100 steps instead of 50 steps on the osworld-verified, so I don’t think that this matters.
All of this is to say, some models scores improved on the osworld-verified, and some declined in score. The redesign to osworld-verified was because the original test had bugs, not in order to make a brand new test (otherwise they would still be tracking the old benchmark). The osworld-verified is the spiritual successor to the osworld-verified, and knowledgeable human performance on the benchmark remains around 70%. I think for all intents and purposes, it is worth treating as the same benchmark, though I definitely will update my post soon to reflect that the benchmark changed since AI-2027 was written.
Finally, while researching the osworld benchmark, I discovered that in the past few days, a new high score was achieved by agent s3 w/ GPT-5 bBoN (N=10). The resulting score was 70%, which is human level performance, and it was achieved on October 3rd, 2025. I will also update my post to reflect that at the very beginning of October, a higher score than was projected for August was achieved on the osworld-verified.
Good response. A few things I do want to stress:
I personally see the lower bound as 33% slower. That’s enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates—this is quite a prediction gap.
Is this true? They haven’t updated their abstract claiming 72.36% (which was from the old version) and I’m wondering if they simply haven’t re-evaluated.
But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.
All said, I trust swe-bench-verified more regardless to track progress:
We’re relying on a well-made benchmark that was done as a second pass by OpenAI. os-world is not that.
Labs seem to be targetting more—low hanging fuit like attaching python interpreters just doesn’t exist for this benchmark ( I’m not sure if the ai-2027 considered this issue when making their os-world predictions)..
We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.