METR Research Update: Algorithmic vs. Holistic Evaluation

Link post

TL;DR

  • On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/​linting, or general code quality.

  • This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

Bar chart comparing holistic and algorithmic scoring