OSworld likely has serious grading issues because it grades on dyanmically changing websites with static assumptions. It might saturate around 70% and likely will only get worse over time.
It would have been ideal for the ai-2027 authors to have defined what they believed SOTA on swe-bench-verified was at time of writing. As you note, on swebench.com leaderboard it was 62.4%-65% (basically Claude 3.7 sonnet’s pass @1), while claude 3.7 sonnet with its parallel compute / learned scoring function was already at 70.3%. A 15% gain vs. a 22.6% gain in 4 months is quite a different projection—and today, we’re either at 73% (codex or sonnet, 1 shot), 75% (unchecked leaderboard), or 80.2% (Sonnet 4 with parallel compute).
As noted elsewhere, the trendline from Sonnet models puts us at ~84% on August 31. So not much of a miss, but again this hedges on a top agentic model being released around end of Aug. (The argument against this though is that the public board trend is showing slowdown).
swe-bench-verified leaderboard is ~3 weeks behind submissions as far as I can tell. (history suggests this too). So there’s really 2 months to get from 75.2% to 85%.
That said, I am also skeptical we’ll hit 85% by end of August (pass@1 definition) - new Claude is unlikely to ship by then and the fact that neither Opus outperforms Sonnet and Codex barely outperforms O3 suggest the remaining tasks are quite difficult.
Key points:
OSworld likely has serious grading issues because it grades on dyanmically changing websites with static assumptions. It might saturate around 70% and likely will only get worse over time.
It would have been ideal for the ai-2027 authors to have defined what they believed SOTA on swe-bench-verified was at time of writing. As you note, on swebench.com leaderboard it was 62.4%-65% (basically Claude 3.7 sonnet’s pass @1), while claude 3.7 sonnet with its parallel compute / learned scoring function was already at 70.3%. A 15% gain vs. a 22.6% gain in 4 months is quite a different projection—and today, we’re either at 73% (codex or sonnet, 1 shot), 75% (unchecked leaderboard), or 80.2% (Sonnet 4 with parallel compute).
As noted elsewhere, the trendline from Sonnet models puts us at ~84% on August 31. So not much of a miss, but again this hedges on a top agentic model being released around end of Aug. (The argument against this though is that the public board trend is showing slowdown).
swe-bench-verified leaderboard is ~3 weeks behind submissions as far as I can tell. (history suggests this too). So there’s really 2 months to get from 75.2% to 85%.
That said, I am also skeptical we’ll hit 85% by end of August (pass@1 definition) - new Claude is unlikely to ship by then and the fact that neither Opus outperforms Sonnet and Codex barely outperforms O3 suggest the remaining tasks are quite difficult.