METR Research Update: Algorithmic vs. Holistic Evaluation

David Rein13 Aug 2025 22:47 UTC

101 points

7 comments1 min readLW link

TL;DR

On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

Bar chart comparing holistic and algorithmic scoring

What links here?

David Rein13 Aug 2025 22:47 UTC

101 points

7 comments1 min readLW link