Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
Anthropic suggests multi-agent scaffolds are much better for research.
We note some of what that might look like here.