METR Research Update: Algorithmic vs. Holistic Evaluation

David Rein13 Aug 2025 22:47 UTC

101 points

7 comments1 min readLW link

Link post

TL;DR

On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

Bar chart comparing holistic and algorithmic scoring

What links here?

David Rein13 Aug 2025 22:47 UTC

101 points

7 comments1 min readLW link

tdko 14 Aug 2025 17:41 UTC
25 points
14
Surprised that this hasn’t gotten more discussion. There’s some potentially big implications for the time horizons study, which has become fairly load-bearing in timelines discourse.
- Daniel Kokotajlo 15 Aug 2025 13:52 UTC
  13 points
  2
  Parent
  Yeah, this is an update against basing timelines heavily on the time horizon extrapolation.
  
  I still think it’s the best single piece of evidence / best benchmark extrapolation that we have though? Curious for your thoughts.
- lc 14 Aug 2025 22:38 UTC
  5 points
  0
  Parent
  Yup, this is one of the largest problems with using AI solutions today.
- David Rein 8 Sep 2025 20:58 UTC
  3 points
  0
  Parent
  Important to caveat that these results are pretty small—I wouldn’t take the absolute numbers too seriously beyond the general “algorithmic scoring may often overestimate software capabilities”.
Adam Karvonen 14 Aug 2025 18:49 UTC
9 points
7
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.
O O 15 Aug 2025 23:34 UTC
5 points
0
I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.

I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.
- lc 17 Aug 2025 7:54 UTC
  3 points
  0
  Parent
  
  I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)
  
  Wondering why you think this