I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)