AI Agent Benchmarks Are Broken

Link post

We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents’ performance by up to 100%.

https://substackcdn.com/image/fetch/$s_!es-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b36bada-a016-48fc-90f6-f50766a8c55c_1600x389.png — *Results of applying ABC on ten widely used AI agent benchmarks.*

To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.

https://substackcdn.com/image/fetch/$s_!33Iq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89022218-0a59-48f5-b648-b0ff39fffb70_1600x432.png — *Operational and conceptual processes of AI agent evaluation. Task and outcome validity are essential to ensure that benchmark results truly reflect agents’ capabilities.*