From my point of view, both of AI Village and METR, on top of not doing the straightforward thing of advocating for a pause, are bad on their own terms.
Either you fail to capture the relevant capabilities and build unwarranted confidence that things are ok, or you are doing public competitive elicitation & amplification work.
Because there is a reason for why Cursor and Claude Code exist. I’d suggest looking at what they do for more details.
METR is not in the business of building code agents. Why is their work informing so much of your views on the usefulness Cursor or Claude Code?
This is literally the point I make above.
Either you fail to capture the relevant capabilities and build unwarranted confidence that things are ok, or you are doing public competitive elicitation & amplification work.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)
Code Agents (Cursor or Claude Code) are much better at performing code tasks than their fine-tune equivalent, mainly because of the scaffolding.
When I told you that we should not put 4% of the global alignment spending budget in AI Village, you asked me if I thought METR should also not get as much funding as it does.
It should now be more legible why.
From my point of view, both of AI Village and METR, on top of not doing the straightforward thing of advocating for a pause, are bad on their own terms.
Either you fail to capture the relevant capabilities and build unwarranted confidence that things are ok, or you are doing public competitive elicitation & amplification work.
Curious what makes you think this.
My impression of (publicly available) research is that it’s not obvious—e.g. Claude 4 Opus with minimal scaffold with just a bash tool is not that bad at SWEBench (67% vs 74%). And Metr’s elicitation efforts were dwarfed by OpenAI doing post-training.
Because there is a reason for why Cursor and Claude Code exist. I’d suggest looking at what they do for more details.
METR is not in the business of building code agents. Why is their work informing so much of your views on the usefulness Cursor or Claude Code?
This is literally the point I make above.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)