We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents’ performance by up to 100%.
Results of applying ABC on ten widely used AI agent benchmarks.
To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.
Operational and conceptual processes of AI agent evaluation. Task and outcome validity are essential to ensure that benchmark results truly reflect agents’ capabilities.
AI Agent Benchmarks Are Broken
Link post
We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents’ performance by up to 100%.
To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.