OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks

Link post

Following last week’s RAND report indicating that LLMs have no effect on people’s ability to make bioweapons, OpenAI has published a study with a similar design and results.

Summary

To evaluate [biological threats posed by AI], we conducted a study with 100 human participants, comprising (a) 50 biology experts with PhDs and professional wet lab experience and (b) 50 student-level participants, with at least one university-level course in biology. Each group of participants was randomly assigned to either a control group, which only had access to the internet, or a treatment group, which had access to GPT-4 in addition to the internet. Each participant was then asked to complete a set of tasks covering aspects of the end-to-end process for biological threat creation.A[A]

To our knowledge, this is the largest to-date human evaluation of AI’s impact on biorisk information.

Findings. Our study assessed uplifts in performance for participants with access to GPT-4 across five metrics (accuracy, completeness, innovation, time taken, and self-rated difficulty) and five stages in the biological threat creation process (ideation, acquisition, magnification, formulation, and release). We found mild uplifts in accuracy and completeness for those with access to the language model. Specifically, on a 10-point scale measuring accuracy of responses, we observed a mean score increase of 0.88 for experts and 0.25 for students compared to the internet-only baseline, and similar uplifts for completeness (0.82 for experts and 0.41 for students). However, the obtained effect sizes were not large enough to be statistically significant, and our study highlighted the need for more research around what performance thresholds indicate a meaningful increase in risk. Moreover, we note that information access alone is insufficient to create a biological threat, and that this evaluation does not test for success in the physical construction of the threats.

Notes

  • They studied how helpful GPT-4 is to bio experts and students.

  • They gave a “research-only” version of GPT-4 without safeguards to experts, but not to students, citing information hazard concerns. GPT-4′s safeguards make the model less useful in creating bioweapons, so the study is likely underestimating GPT-4′s capability to help students.

  • For both the experts and the students they had a sample size of N=50 (25 internet only; 25 internet and model). Given the relatively low sample size, the study isn’t powerful enough to pick up on small effects. Still, the study shows that models currently don’t have a strong effect on people’s ability to create bioweapons.

  • They mention additional limitations of the study including,

    • time constraints (participants only had 5 hours to come up with a plan)

    • no tool usage for GPT-4

    • studying individuals rather than groups

    • focusing on the ability to create existing threats, rather than novel threats

  • One difficulty they highlight as important is setting a threshold for biorisk – how useful does a model need to be for it to count as a significant threat?

Edit: As @Chris_Leong points out, Gary Marcus claims that the statistical methods used by OpenAI were wrong, and underestimate the effect of using an LLM. Methods like the t-test (and likely also ANCOVA) would show statistically significant results.