What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.
That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?
OSWorld isn’t in machine learning or mathematics, so we don’t have much data to go on.
But what we do have suggests ~4 month doubling time from which we arrive at an ~8 minute 50% time horizon by EOY, Given: > # Difficulty Split: Easy (<60s): 28.72%, Medium (60-180s): 40.11%, Hard (>180s): 30.17%
This does suggest greater than 80% by EOY, but this depends on model release cadence etc.
What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.
That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?
The most naive method would be to use the extrapolation based on the trend on OSWorld here: https://www.lesswrong.com/posts/6KcP7tEe5hgvHbrSF/metr-how-does-time-horizon-vary-across-domains. My guess is that this yields sane results.
The main delta is probably a slower doubling time.
OSWorld isn’t in machine learning or mathematics, so we don’t have much data to go on.
But what we do have suggests ~4 month doubling time from which we arrive at an ~8 minute 50% time horizon by EOY, Given:
> # Difficulty Split: Easy (<60s): 28.72%, Medium (60-180s): 40.11%, Hard (>180s): 30.17%
This does suggest greater than 80% by EOY, but this depends on model release cadence etc.