Following last week’s RAND report indicating that LLMs have no effect on people’s ability to make bioweapons, OpenAI has published a study with a similar design and results.
Summary
To evaluate [biological threats posed by AI], we conducted a study with 100 human participants, comprising (a) 50 biology experts with PhDs and professional wet lab experience and (b) 50 student-level participants, with at least one university-level course in biology. Each group of participants was randomly assigned to either a control group, which only had access to the internet, or a treatment group, which had access to GPT-4 in addition to the internet. Each participant was then asked to complete a set of tasks covering aspects of the end-to-end process for biological threat creation.A[A]
To our knowledge, this is the largest to-date human evaluation of AI’s impact on biorisk information.
Findings. Our study assessed uplifts in performance for participants with access to GPT-4 across five metrics (accuracy, completeness, innovation, time taken, and self-rated difficulty) and five stages in the biological threat creation process (ideation, acquisition, magnification, formulation, and release). We found mild uplifts in accuracy and completeness for those with access to the language model. Specifically, on a 10-point scale measuring accuracy of responses, we observed a mean score increase of 0.88 for experts and 0.25 for students compared to the internet-only baseline, and similar uplifts for completeness (0.82 for experts and 0.41 for students). However, the obtained effect sizes were not large enough to be statistically significant, and our study highlighted the need for more research around what performance thresholds indicate a meaningful increase in risk. Moreover, we note that information access alone is insufficient to create a biological threat, and that this evaluation does not test for success in the physical construction of the threats.
Notes
They studied how helpful GPT-4 is to bio experts and students.
They gave a “research-only” version of GPT-4 without safeguards to experts, but not to students, citing information hazard concerns. GPT-4′s safeguards make the model less useful in creating bioweapons, so the study is likely underestimating GPT-4′s capability to help students.
For both the experts and the students they had a sample size of N=50 (25 internet only; 25 internet and model). Given the relatively low sample size, the study isn’t powerful enough to pick up on small effects. Still, the study shows that models currently don’t have a strong effect on people’s ability to create bioweapons.
They mention additional limitations of the study including,
time constraints (participants only had 5 hours to come up with a plan)
no tool usage for GPT-4
studying individuals rather than groups
focusing on the ability to create existing threats, rather than novel threats
One difficulty they highlight as important is setting a threshold for biorisk – how useful does a model need to be for it to count as a significant threat?
Edit: As @Chris_Leong points out, Gary Marcus claims that the statistical methods used by OpenAI were wrong, and underestimate the effect of using an LLM. Methods like the t-test (and likely also ANCOVA) would show statistically significant results.
As far as I can tell, their results find that bio experts don’t perform notably better than students at creating bioweapons. (Given just access to the internet.)
This seems somewhat surprising to me as I would have naively expected a large gap between experts and students.
This might indicate some issues with the tasks or the limited sample size. Or perhaps internet access and general research ability dominates over prior experience?
Minimally, it might be nice to verify that bioweapons experts perform significantly better than students on this evaluation to verify that there is some signal in the measurement. (I don’t see these results anywhere, though I haven’t read in detail.)
This is surprising – thanks for bringing this up!
The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn’t actually helpful, which implies we don’t need to worry about LLMs giving people expert-level biology knowledge.
Some alternative interpretations:
The study doesn’t accurately measure the gap between experts and non-expert
The knowledge needed to build a bioweapon is super niche. It’s not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
LLMs won’t help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).
Gary Marcus has criticised the results here:
Also, there was likely an effect, but sample sizes were too small to detect this:
Thanks for pointing this out! I’ve added a note about it to the main post.