AnthonyC comments on Reactions to METR task length paper are insane

AnthonyC 11 Apr 2025 19:07 UTC
7 points
3
Personally I think 2030 is possible but aggressive, and my timeline estimate it more around 2035. Two years ago I would have said 2040 or a bit later, and capabilities gains relevant to my own field and several others I know reasonably well have shortened that, along with the increase in funding for further development.
- The Claude/Pokemon thing is interesting, and overall Pokemon-playing trend across Anthropic’s models is clearly positive. I can’t say I had any opinion at all about how far along an LLM would get at Pokemon before that result got publicized, so I’m curious if you did. What rate of progress on that benchmark would you expect in a short-timelines world? If there’s an LLM agent that can beat Pokemon in six months, or a year, or two years?
- Self-driving vehicles are already more of a manufacturing and regulatory problem than a technical one. For example, as long as the NHTSA only lets manufacturers deploy 2500 self-driving vehicles a year each in the US, broad adoption cannot happen, regardless of technical capabilities or willingness to invest and build.
- I also don’t think task length is a perfect metric. But it’s a useful one, a lower bound on what’s needed to be able to complete all human-complete intellectual tasks. Like everything else to date, there is likely something else to look at as we saturate the benchmark.
- I agree novel insights (or more of them, I can’t say there haven’t been any) will be strong evidence. I don’t understand the reason for thinking this should already be observable. Very, very few humans ever produce anything like truly novel insights at the forefront of human knowledge. “They have not yet reached the top <0.1% of human ability in any active research field” is an incredibly high bar I wouldn’t expect to pass until we’re already extremely close to AGI, and it should be telling that that late bar is on the short list of signs you are looking for. I would also add two other things: First, how many research labs do you think there are that have actually tried to use AI to make novel discoveries, given how little calendar time there has been to actually figure out how to adopt and use the models we do have? If Gemini 2.5 could do this today, I don’t think we’d necessarily have any idea. And second, do you believe it was a mistake that two of the 2024 Nobel prizes went to AI researchers, for work that contributes to the advancement of chemistry and physics?
- AI usefulness is strongly field dependent today. In my own field, it went from a useful supplementary tool to “This does 50-80% of what new hires did and 30-50% of what I used to do, and were scrambling to refactor workflows to take advantage of it.”
- Hallucinations are annoying, but good prompting strategy, model selection, and task definition can easily get the percentages down to the low single digits. In many cases the rates can easily be lower than those of a smart human given a similar amount of context. I can often literally just tell an LLM “Rewrite this prompt in such a way as to reduce the risk of hallucinations or errors, answer that prompt, then go back and check for and fix any mistakes” and that’ll cut it down a good 50-90% depending on the topic and the question complexity. I can ask the model to cite sources for factual claims, dump the sources back into the next prompt, and ask if there are any factual claims not supported by the sources. It’s a little circular, but also a bit Socratic and not really any worse than when I’ve tried to teach difficult mental skills to some bright human adults