I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.11 Still, many companies find ways to fit AI agents into their workflows.12
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).