Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).
Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).