Thoughts on CBRN/bio results for the new OpenAI open-weight model (GPT-oss-120b).
My view is that releasing this model is net positive on fatalities reduced over the longer term, but the release kills substantial numbers of people in expectation due to bioterror (maybe ~2000 in absolute terms counterfactually, but more if OpenAI were the only actor or other actors were always at least as safe as OpenAI). See this earlier post for my general thoughts on the situation with open-weight models around the current level of capability.
I think it’s plausible but unlikely (perhaps 25% likely?) that in retrospect the model is well described as meeting the CBRN high threshold from OpenAI’s preparedness framework. I’d guess 35% likely for the same question for o3.
Overall, the model seems pretty close to as capable as o4-mini or o3, but OpenAI claims they did (more?) dataset filtering. It’s unclear if this dataset filtering did anything, the model doesn’t appear to have substantially degraded performance in any bio task, including ones where I would have hoped filtering matters (in particular, I’d hope you’d filter virology text in a way that greatly lowers VCT scores, but this doesn’t appear to be the case).
So, one big source of uncertainty is how well the data filtering worked. I’d guess it does very little based on my understanding of the situation (also, I think they say they did the same filtering for 4o?), but it seems reasonably likely (40%?) it has a big effect.
My overall view is that the release is net good and I think I tentatively support OpenAI releasing these open-weight models (and generally behaving in this way), but I’m reluctant to endorse it (especially in a strong way) because of the cost in lives.
Redwood in particular will probably find this release helpful for our research and I expect other organizations in a similar position will also be able to do better research because of this release.
The external safety expert feedback (see section 5.1.1 in the model card) seems good and I’m glad OpenAI did this. Their selection of experts seems reasonable and doesn’t feel cherry-picked to me. Getting expert feedback and expert opinions on the overall interpretation seems like a good thing to do going forward (as models get more capable and especially as evaluations get more uncertain and subjective).
That said, it’s unclear what the independent experts thought about the overall interpretation of the results: my sense is that after recommendations were adopted, these experts were probably mostly happy with the quality of the evaluations (at least putting aside much higher effort elicitation), but it’s unclear to me that these evaluations rule out CBRN high or rule out substantial CBRN fatalities caused by models of this capability level. (These evaluations do pretty strongly indicate that the model isn’t much better than other open weight models from a bio uplift perspective, but this is only one relevant question and the overall level of risk from models this capable is the more relevant question for avoiding a race to the bottom.)
I overall feel reasonably happy with the evaluations (given my limited understanding) and the direct presentation of the evaluations (in terms of the literal graphs), but my view is that it seems plausible that models of this capability level meet the criteria for CBRN high and I don’t buy that OpenAI’s evals rule this out. I’m also uncertain about the quality of the elicitation, but OpenAI seemingly did a reasonable job given only low/moderate effort elicitation (as in, without doing a bunch of bio elicitation R&D).
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.
Thoughts on CBRN/bio results for the new OpenAI open-weight model (GPT-oss-120b).
My view is that releasing this model is net positive on fatalities reduced over the longer term, but the release kills substantial numbers of people in expectation due to bioterror (maybe ~2000 in absolute terms counterfactually, but more if OpenAI were the only actor or other actors were always at least as safe as OpenAI). See this earlier post for my general thoughts on the situation with open-weight models around the current level of capability.
I think it’s plausible but unlikely (perhaps 25% likely?) that in retrospect the model is well described as meeting the CBRN high threshold from OpenAI’s preparedness framework. I’d guess 35% likely for the same question for o3.
Overall, the model seems pretty close to as capable as o4-mini or o3, but OpenAI claims they did (more?) dataset filtering. It’s unclear if this dataset filtering did anything, the model doesn’t appear to have substantially degraded performance in any bio task, including ones where I would have hoped filtering matters (in particular, I’d hope you’d filter virology text in a way that greatly lowers VCT scores, but this doesn’t appear to be the case). So, one big source of uncertainty is how well the data filtering worked. I’d guess it does very little based on my understanding of the situation (also, I think they say they did the same filtering for 4o?), but it seems reasonably likely (40%?) it has a big effect.
My overall view is that the release is net good and I think I tentatively support OpenAI releasing these open-weight models (and generally behaving in this way), but I’m reluctant to endorse it (especially in a strong way) because of the cost in lives.
Redwood in particular will probably find this release helpful for our research and I expect other organizations in a similar position will also be able to do better research because of this release.
The external safety expert feedback (see section 5.1.1 in the model card) seems good and I’m glad OpenAI did this. Their selection of experts seems reasonable and doesn’t feel cherry-picked to me. Getting expert feedback and expert opinions on the overall interpretation seems like a good thing to do going forward (as models get more capable and especially as evaluations get more uncertain and subjective).
That said, it’s unclear what the independent experts thought about the overall interpretation of the results: my sense is that after recommendations were adopted, these experts were probably mostly happy with the quality of the evaluations (at least putting aside much higher effort elicitation), but it’s unclear to me that these evaluations rule out CBRN high or rule out substantial CBRN fatalities caused by models of this capability level. (These evaluations do pretty strongly indicate that the model isn’t much better than other open weight models from a bio uplift perspective, but this is only one relevant question and the overall level of risk from models this capable is the more relevant question for avoiding a race to the bottom.)
I overall feel reasonably happy with the evaluations (given my limited understanding) and the direct presentation of the evaluations (in terms of the literal graphs), but my view is that it seems plausible that models of this capability level meet the criteria for CBRN high and I don’t buy that OpenAI’s evals rule this out. I’m also uncertain about the quality of the elicitation, but OpenAI seemingly did a reasonable job given only low/moderate effort elicitation (as in, without doing a bunch of bio elicitation R&D).
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
Image generation doesn’t seem to be able to come anywhere near ‘pipette off this layer, not this layer’ for bio/chem experiments.
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.