. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That’s enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates—this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven’t updated their abstract claiming 72.36% (which was from the old version) and I’m wondering if they simply haven’t re-evaluated.
But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.
All said, I trust swe-bench-verified more regardless to track progress:
We’re relying on a well-made benchmark that was done as a second pass by OpenAI. os-world is not that.
Labs seem to be targetting more—low hanging fuit like attaching python interpreters just doesn’t exist for this benchmark ( I’m not sure if the ai-2027 considered this issue when making their os-world predictions)..
We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.
Good response. A few things I do want to stress:
I personally see the lower bound as 33% slower. That’s enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates—this is quite a prediction gap.
Is this true? They haven’t updated their abstract claiming 72.36% (which was from the old version) and I’m wondering if they simply haven’t re-evaluated.
But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.
All said, I trust swe-bench-verified more regardless to track progress:
We’re relying on a well-made benchmark that was done as a second pass by OpenAI. os-world is not that.
Labs seem to be targetting more—low hanging fuit like attaching python interpreters just doesn’t exist for this benchmark ( I’m not sure if the ai-2027 considered this issue when making their os-world predictions)..
We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.