Thanks Ryan. Although, it’s unclear how powerful a model needs to be to be dangerous (which is why I’m working on evals to measure this). In my opinion, Llama2 and Mixtral are potentially already quite dangerous given the right fine-tuning regime.
So, if I’m correct, we’re already at the point of ‘too late, the models have been released. The problem will get only get worse as more powerful models are released. Only the effort of processing the raw data into a training set and running the fine-tuning is saving the world from having a seriously dangerous bioweapon-designer-tuned LLM in bad actors hands.’
Of course, that’s just like… my well-informed opinion, man. I deliberately created such a bioweapon-designer-tuned LLM from Llama 70B as part of my red teaming work on biorisk evals. It spits out much scarier information than a google search supplies. Much. I realize there’s a lot of skepticism around my claim on this, so not much can be done until better objective evaluations of the riskiness of the bioweapon design capability are developed. For now, I can only say that this is my opinion from personal observations.
It spits out much scarier information than a google search supplies. Much.
I see a sense in which GPT-4 is completely useless for serious programming in the hands of a non-programmer who wouldn’t be capable/inclined to become a programmer without LLMs, even as it’s somewhat useful for programming (especially with unfamiliar but popular libraries/tools). So the way in which a chatbot helps needs qualification.
One possible measure is how much a chatbot increases the fraction of some demographic that’s capable of some achievement within some amount of time. All these “changes the difficulty by 4x” or “by 1.25x” need to mean something specific, otherwise there is hopeless motte-and-bailey that allows credible reframing of any data as fearmongering. That is, even when it’s only intuitive guesses, the intuitive guesses should be about a particular meaningful thing rather than level of scariness. Something prediction-marketable.
Yes, I quite agree. Do you have suggestions for what a credible objective eval might consist of? What sort of test would seem convincing to you, if administered by a neutral party?
Here’s my guess (which is maybe the obvious thing to do).
Take bio undergrads, have them do synthetic biology research projects (ideally using many of the things which seem required for bioweapons), randomize into two groups where one is allowed to use LLMs (e.g. GPT-4) and one isn’t. The projects should ideally have a reasonable duration (at least >1 week, more ideally >4 weeks). Also, for both groups, provide high level research advice/training about how to use the research tools they are given (in the LLM case, advice about how to best use LLMs).
Then, have experts in the field assess the quality of projects.
For a weaker preliminary experiment, you could do 2-4 hour experiments of doing some quick synth bio lab experiment with the same approximate setup (but there are complications with the shortened duration).
In my opinion, Llama2 and Mixtral are potentially already quite dangerous given the right fine-tuning regime.
Indeed, I think it seems pretty unlikely that these models (finetuned effectively using current methods) change the difficulty of making a bioweapon by more than a factor of 4x. (Though perhaps you think something like “these models (finetuned effectively) make it maybe 25% easier to make a bioweapon and that’s pretty scary”.)
Yes, I’m unsure about the multiplication factors on “likelihood of bad actor even trying to make a bioweapon” and on “likelihood of succeeding given the attempt”. I think probably both are closer to 4x than 1.25x. But I think it’s understandable that this claim on my part seems implausible. Hopefully at some point I’ll have a more objective measure available.
Thanks Ryan. Although, it’s unclear how powerful a model needs to be to be dangerous (which is why I’m working on evals to measure this). In my opinion, Llama2 and Mixtral are potentially already quite dangerous given the right fine-tuning regime.
So, if I’m correct, we’re already at the point of ‘too late, the models have been released. The problem will get only get worse as more powerful models are released. Only the effort of processing the raw data into a training set and running the fine-tuning is saving the world from having a seriously dangerous bioweapon-designer-tuned LLM in bad actors hands.’
Of course, that’s just like… my well-informed opinion, man. I deliberately created such a bioweapon-designer-tuned LLM from Llama 70B as part of my red teaming work on biorisk evals. It spits out much scarier information than a google search supplies. Much. I realize there’s a lot of skepticism around my claim on this, so not much can be done until better objective evaluations of the riskiness of the bioweapon design capability are developed. For now, I can only say that this is my opinion from personal observations.
I see a sense in which GPT-4 is completely useless for serious programming in the hands of a non-programmer who wouldn’t be capable/inclined to become a programmer without LLMs, even as it’s somewhat useful for programming (especially with unfamiliar but popular libraries/tools). So the way in which a chatbot helps needs qualification.
One possible measure is how much a chatbot increases the fraction of some demographic that’s capable of some achievement within some amount of time. All these “changes the difficulty by 4x” or “by 1.25x” need to mean something specific, otherwise there is hopeless motte-and-bailey that allows credible reframing of any data as fearmongering. That is, even when it’s only intuitive guesses, the intuitive guesses should be about a particular meaningful thing rather than level of scariness. Something prediction-marketable.
I was trying to say “cost in time/money goes down by that factor for some group”.
Yes, I quite agree. Do you have suggestions for what a credible objective eval might consist of? What sort of test would seem convincing to you, if administered by a neutral party?
Here’s my guess (which is maybe the obvious thing to do).
Take bio undergrads, have them do synthetic biology research projects (ideally using many of the things which seem required for bioweapons), randomize into two groups where one is allowed to use LLMs (e.g. GPT-4) and one isn’t. The projects should ideally have a reasonable duration (at least >1 week, more ideally >4 weeks). Also, for both groups, provide high level research advice/training about how to use the research tools they are given (in the LLM case, advice about how to best use LLMs).
Then, have experts in the field assess the quality of projects.
For a weaker preliminary experiment, you could do 2-4 hour experiments of doing some quick synth bio lab experiment with the same approximate setup (but there are complications with the shortened duration).
Indeed, I think it seems pretty unlikely that these models (finetuned effectively using current methods) change the difficulty of making a bioweapon by more than a factor of 4x. (Though perhaps you think something like “these models (finetuned effectively) make it maybe 25% easier to make a bioweapon and that’s pretty scary”.)
Yes, I’m unsure about the multiplication factors on “likelihood of bad actor even trying to make a bioweapon” and on “likelihood of succeeding given the attempt”. I think probably both are closer to 4x than 1.25x. But I think it’s understandable that this claim on my part seems implausible. Hopefully at some point I’ll have a more objective measure available.