I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.11 Still, many companies find ways to fit AI agents into their workflows.12
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.
If I understand correctly, Claude’s pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn’t exceeding equivalent cost of an engineer.
codex’s pass @ 8 score seems to be saying “the correct solution was present in 8 attempts, but the model doesn’t actually know what the correct result is”. That shouldn’t count.
Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).
I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.
If I understand correctly, Claude’s pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn’t exceeding equivalent cost of an engineer.
codex’s pass @ 8 score seems to be saying “the correct solution was present in 8 attempts, but the model doesn’t actually know what the correct result is”. That shouldn’t count.
Why do I see no higher than about 75% here?
https://www.swebench.com
Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).