As I understand it, the official SWEBench-Verified page is consistently giving certain resources and setups to the models, but when a company like Anthropic or OpenAI releases their scores on the SWEBench-Verified, they use their own infrastructure which presumably performs better. There was some discussion already elsewhere in the comments about whether the Claude 4.5 Sonnet score I gave should even count, given that it used parallel test time compute, I justified by decision to include this score like this:
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
Thank you. You make a good case for including this as evidence that capabilities are increasing. I suppose the question is whether they are increasing at the rate needed for short timelines. I think it’s worth asking whether the same-infrastructure performance showing zero improvement in four months is something that would have been expected four months ago. Of course, this is only one metric, over a short timeframe.
Yeah, I definitely think the improvements on osworld are much more impressive than the improvements on sweverified. I also think same infrastructure performance is a bit of a misleading in the sense that when we get super intelligence, I think it is very unlikely it will have the same infrastructure we use today. We should expect infrastructure changes to result in improvements I think!
As I understand it, the official SWEBench-Verified page is consistently giving certain resources and setups to the models, but when a company like Anthropic or OpenAI releases their scores on the SWEBench-Verified, they use their own infrastructure which presumably performs better. There was some discussion already elsewhere in the comments about whether the Claude 4.5 Sonnet score I gave should even count, given that it used parallel test time compute, I justified by decision to include this score like this:
Thank you. You make a good case for including this as evidence that capabilities are increasing. I suppose the question is whether they are increasing at the rate needed for short timelines. I think it’s worth asking whether the same-infrastructure performance showing zero improvement in four months is something that would have been expected four months ago. Of course, this is only one metric, over a short timeframe.
Yeah, I definitely think the improvements on osworld are much more impressive than the improvements on sweverified. I also think same infrastructure performance is a bit of a misleading in the sense that when we get super intelligence, I think it is very unlikely it will have the same infrastructure we use today. We should expect infrastructure changes to result in improvements I think!