On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.
METR Research Update: Algorithmic vs. Holistic Evaluation
Link post
TL;DR
On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.