All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
Image generation doesn’t seem to be able to come anywhere near ‘pipette off this layer, not this layer’ for bio/chem experiments.