GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.
Now that GPT-5 is released and we have details about Grok’s failure, we can start the re-assessment.
GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.