For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVEBench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
We conducted an in-depth analysis of specific issues present in each agentic benchmark. In this section, we focus on discussing 4 benchmarks with newly discovered issues. We defer a detailed description of all identified issues in Appendix D and experiment designs to E.
1. τ-bench relies on trivial states or substrings as ground truth [...] overestimating performance by 38%. 2. τ-bench also allows agents to list every possible answer [...] overestimating performance by 40%. 3. WebArena [...] uses an LLM-as-a-Judge without validating its accuracy or consistency [...], leading to a 1.4–5.2% performance overestimate. 4. SWE-Lancer fails to fully isolate agents from the ground truth [...], allowing agents to score 100% without solving tasks. 5. KernelBench omits comprehensive fuzzing for edge cases and memory layouts [...] overestimating kernel-correctness performance by approximately 31%. 6. In OSWorld, the task website changes have broken the HTML selectors used for evaluation, leading to a 28% performance underestimation in the chrome task section.
Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.