Brandon Sayler comments on About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

Brandon Sayler 4 Aug 2025 17:09 UTC
2 points
0
Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.
For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVEBench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
We conducted an in-depth analysis of specific issues present in each agentic benchmark. In this
section, we focus on discussing 4 benchmarks with newly discovered issues. We defer a detailed
description of all identified issues in Appendix D and experiment designs to E.

1. $τ$ -bench relies on trivial states or substrings as ground truth [...] overestimating performance by 38%.
2. $τ$ -bench also allows agents to list every possible answer [...] overestimating
performance by 40%.
3. WebArena [...] uses an LLM-as-a-Judge without validating its accuracy or consistency [...], leading to a 1.4–5.2% performance overestimate.
4. SWE-Lancer fails to fully isolate agents from the ground truth [...], allowing agents to score 100% without solving tasks.
5. KernelBench omits comprehensive fuzzing for edge cases and memory layouts [...] overestimating kernel-correctness performance by approximately 31%.
6. In OSWorld, the task website changes have broken the HTML selectors used for evaluation,
leading to a 28% performance underestimation in the chrome task section.

Brandon Sayler comments on About 30% of Humanity’s Last Exam chemistry/​biology answers are likely wrong

Brandon Sayler comments on About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong