Because there is a reason for why Cursor and Claude Code exist. I’d suggest looking at what they do for more details.
METR is not in the business of building code agents. Why is their work informing so much of your views on the usefulness Cursor or Claude Code?
This is literally the point I make above.
Either you fail to capture the relevant capabilities and build unwarranted confidence that things are ok, or you are doing public competitive elicitation & amplification work.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)
Because there is a reason for why Cursor and Claude Code exist. I’d suggest looking at what they do for more details.
METR is not in the business of building code agents. Why is their work informing so much of your views on the usefulness Cursor or Claude Code?
This is literally the point I make above.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)