About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

bohaska29 Jul 2025 11:59 UTC

209 points

Update (20th Sep 2025): Scale AI has revised their Humanity’s Last Exam preprint in light of this evaluation, and conducted their own checks on the accuracy of HLE questions, finding an error rate of 18% instead:

We conducted another targeted peer review on a biology, chemistry, and health subset, as proposed by [47], and found an expert disagreement rate of approximately 18%. This level of expert disagreement is in line with what is observed in other challenging, expert-grade machine learning benchmarks and also observed in other similarly designed work; for example, [6] notes that disagreement among expert physicians is frequent on complex health topics

They also note:

To illustrate, if we were to adopt a single-reviewer methodology where a
question is flagged based on just one dissenting expert, the disagreement rate on the aforementioned health-focused subset jumps from 18% to 25%, which is close to the setting described in [47].

FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them.

The post’s first paragraph:

Humanity’s Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity’s Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions.

About the initial review process for HLE questions:

[...] Reviewers were given explicit instructions: “Questions should ask for something precise and have an objectively correct, univocal answer.” The review process was challenging, and performed in two rounds. First reviewers rated questions for domain alignment and complexity, but notably reviewers were not required to verify the full accuracy of a question’s rationale if it would take “more than 5 minutes” (https://arxiv.org/abs/2501.14249). A second round of question selection was performed to find the best questions via having high-quality reviews. Finally, a bug-bounty crowd sourcing system was used for post-release refinement.
So question writers had to verify frontier models don’t get the questions correct. Reviewers only had to spend 5 minutes reviewing them. Reviewers didn’t need to verify correctness.

Some examples they show:

We believe this decision led to many “gotcha” adversarial-type questions that ultimately confused the question writers and evaluators to the point that the questions lost all meaning.
Here’s an example question: What was the rarest noble gas on Earth as as a percentage of all terrestrial matter in 2002? Answer: Oganesson

First, we would argue this is not a “PhD level research” question, but instead trivia. [...] Only five atoms of oganesson have ever been made and no properties have been measured. Predictions show that it will be solid, not a gas, and some argue it is not noble because of how reactive it is (10.1002/anie.202011976). [...] There are also multiple peer-reviewed papers published (including in 2002) that list the terrestrial fractions of noble gases and oganesson is not considered as part of terrestrial matter (10.2138/rmg.2002.47.1).
[...]

HLE Question 1: What is the BUD for a single dose container ampule from the time of puncture in a sterile environment?

HLE Answer: 1 hour

HLE Rationale: The Beyond-Use Date (BUD) for a single-dose ampule is determined by the United States Pharmacopeia (USP). [...] It is indeed “best practice” to use an SDC -ampule immediately, however in the USP <797> it states the BUD for an SDC—ampule is 1 hour from the time of puncture or opening. This distinction in time is important as “immediately” is not the same as “1 hour since puncture.” [...]
For Question 1, we were unable to validate the rationale, and in our reading of USP <797> it appears that the 1-hour limit in sterile environments only applies to single dose vials, but not ampules. Our reading, as corroborated by an independent expert, is that the correct answer is to use the ampule immediately. One source of confusion that we see deep research style tools make is from slide 45 on this powerpoint (https://www.pharmacy.ca.gov/meetings/agendas/2019/19_mar_cmpd_mat.pdf), which can mislead to believing the 1-hour time limit. This may have led to the original HLE incorrect answer and is often how deep research style tools recapitulate the HLE answer.
HLE Question 2: Which of the following have Raphidiopterans been recorded feeding on as adults?
Answer Choices:
A. Nectar
B. Māhoe pollen
C. Fungus
D. Karamū leaf tissue
E. Totara Aphids
F. A and E
G. D and E
HLE Answer: A
HLE Rationale: Raphidiopterans (snakeflies) are predatory insects, but several papers have recorded them feeding on nectar. The common names given in B, D and E are endemic to New Zealand. No snakefly has ever been found in New Zealand, so there will be no records of them feeding on those species.
We were unable to find good sources for Question 2’s claim of snakeflies feeding on nectar. Several sources remark that other Orders of the Neuropterida Super Order have been recorded eating nectar, but not specifically Raphidiopteran. For example, Jepson and Head examined the stomachs of deceased Neuropterida and consistently found predation of aphids and occasionally pollen, but no nectar (10.11646/zootaxa.4098.1.5). Aldritch and Zhang wrote a review on the ecology of Neuroptera and pointed out some eat nectar (Chrysopa), but not Neuropterida (nor Raphidiopterans) (10.1146/annurev-ento-010715-023507). While the cited work mentioned in the rationale may exist, we were unable to find any specific examples for either the Inocelliidae or Raphidiidae families eating nectar.
Maybe someone saw a Raphidiopterans eat nectar once, which is extremely out of character, and recorded it somewhere in a way that makes keyword search impossible. Certainly, that answer is not a “univocal” answer, and a domain expert doing research would find the answer “nectar” to be widely contradicted from the sources mentioned above.

I initially noticed this post from spiderduckpig in the Manifold Discord server.

bohaska29 Jul 2025 11:59 UTC

209 points

10 comments4 min readLW link

Igor Ivanov 29 Jul 2025 15:58 UTC
24 points
4
I think it would be cool if more people did similar reviews of other benchmarks. This is a relatively low-effort high-value thing to do compared to creating new benchmarks.
It would significantly increase our capacity to properly measure AI capabilities, and also improved methodological standards for creating new benchmarks.
- the gears to ascension 29 Jul 2025 17:53 UTC
  71 points
  33
  Parent
  benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.
  
  edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.
  - niplav 29 Jul 2025 19:10 UTC
    29 points
    7
    Parent
    ...🤔 …well played, Dan, well played.
    
    (I don’t think Dan Hendrycks deliberately released a benchmark with errors, but it would be hilarious, especially given his gripes about GPQA label noise.)
    - the gears to ascension 29 Jul 2025 19:20 UTC
      7 points
      7
      Parent
      I also very much don’t think errors were on purpose. Abstaining from capability measurement doesn’t seem likely to do much else besides save ones’ own time to do something more useful; I just hope people who are excited to do alignment work don’t end up wasting their time on it. Errors seem likely to end up not particularly slowing things down—someone else is going to make a benchmark, but let a capability person do it so you can spend your time doing something more useful, like figuring out how to algorithmically generate arbitrary amounts of guaranteed-aligned training data from interacting with humans that has some sort of probabilistic guarantee of representing behavior that is in fact good rather than simply easy to generate. (Edit: which is probably done through a somewhat complex theoretical breakthrough about how to incentivize aligned-by-construction exploration or some such thing)
  - Neel Nanda 30 Jul 2025 2:46 UTC
    15 points
    11
    Parent
    While I definitely agree there are negative externalities here, I also think there are extremely positive externalities from key decision makers being better informed, especially knowing how close we are to certain capabilities like AI enabled bioterrorism or cybercrime, or automated R&D/an intelligence explosion, etc. Information is great, and generally I think has a fairly positive effect even if the decision maker is not highly competent. Bioterrorism and cybercrime at least are not things I’m concerned about AGI researchers hill climbing on, automated R&D is much dicier
    
    Private benchmarks seem solid here too
    - the gears to ascension 30 Jul 2025 3:35 UTC
      4 points
      2
      Parent
      I’m surprised to hear that you aren’t concerned about negative benchmarks being hillclimb targets for anyone. This updates me somewhat, though the hypotheses I’m still worried about are ones where the dishonest labs, whichever those turn out to be, are the main source of optimizing-for-bad-behavior-benchmarks. I also expect that bio/chem tasks that aren’t malicious-use-specific, which is the topic at hand, will get optimized for by less-dishonest labs, in at least some cases.
      - Neel Nanda 30 Jul 2025 10:22 UTC
        5 points
        3
        Parent
        Yeah, I feel much better about malicious use specific ones. Agreed that HLE is more generic and this is much worse
  - Donald Hobson 3 Aug 2025 0:25 UTC
    4 points
    0
    Parent
    Consider these possibilities for what benchmarks are doing here.
    Training AI’s on quantum physics questions directly makes the AI smarter.
    Training on quantum physics makes the AI memorize quantum trivia, but doesn’t make it smarter in some deep sense.
    Like 2, except that the existence of the benchmark makes grad student descent more effective. People learn which AI algorithms work best via human trial and error.
    Like 3, except that public benchmarks just get memorized. Making them useless for grad student descent.
dr_s 4 Aug 2025 15:08 UTC
7 points
−4
Subtle psy-op meant to make the AGI stupider, of course.
Brandon Sayler 4 Aug 2025 17:09 UTC
2 points
0
Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.
For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVEBench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
We conducted an in-depth analysis of specific issues present in each agentic benchmark. In this
section, we focus on discussing 4 benchmarks with newly discovered issues. We defer a detailed
description of all identified issues in Appendix D and experiment designs to E.

1. $τ$ -bench relies on trivial states or substrings as ground truth [...] overestimating performance by 38%.
2. $τ$ -bench also allows agents to list every possible answer [...] overestimating
performance by 40%.
3. WebArena [...] uses an LLM-as-a-Judge without validating its accuracy or consistency [...], leading to a 1.4–5.2% performance overestimate.
4. SWE-Lancer fails to fully isolate agents from the ground truth [...], allowing agents to score 100% without solving tasks.
5. KernelBench omits comprehensive fuzzing for edge cases and memory layouts [...] overestimating kernel-correctness performance by approximately 31%.
6. In OSWorld, the task website changes have broken the HTML selectors used for evaluation,
leading to a 28% performance underestimation in the chrome task section.
[ ]
[deleted]

About 30% of Humanity’s Last Exam chemistry/​biology answers are likely wrong

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong