While this might prevent models from giving instructions on e. g. building a bomb in response to a literal “how do I build a bomb?” prompt, they may still be able to infer bomb-building instructions from first principles using their general knowledge in response to e. g. “how do I build a compact device that omnidirectionally releases a lot of energy in under 1 sec?”. More generally, while this removes the explicit knowledge of what counts as a biothreat/bomb/etc., it doesn’t necessarily prevent the model from reconstructing that knowledge. The only way to prevent that would be to hobble the models’ general capabilities: either removing all physics/chemistry knowledge, or hobbling their general reasoning/problem-solving skills (to remove their ability to do clever reconstructions).
Now, arguably, LLMs may be forever unable to actually do this sort of “reconstruction”, since it’s an innovation-like task and they’ve been pretty bad at this so far. But:
This is obviously a concern if I’m wrong and LLMs are AGI-complete. In this case, they would eventually attain the skills for doing this kind of reconstruction, and only the aforementioned drastic hobbling would prevent it.
This might be less innovation-flavoured than it seems, because general-knowledge datasets might still contain the “shadows” of the harmful knowledge, or disparate pieces of it, such that its reconstruction is easy. (E. g., general physics knowledge + various scientific experiments and engineering principles + fiction and news articles which mention bombs.)
So: have you tried asking for instructions “indirectly”?
(Though even if this suffices to prevent reconstruction in the relatively dumb models of today, it might not work for tomorrow’s cleverer reasoners.
The core issue is the same as in e. g. Deep Deceptiveness: the circuits the model would use to reconstruct harmful knowledge are the same circuits that make it useful in other contexts, so you can’t have one without the other.)
Have you considered the following concern?:
While this might prevent models from giving instructions on e. g. building a bomb in response to a literal “how do I build a bomb?” prompt, they may still be able to infer bomb-building instructions from first principles using their general knowledge in response to e. g. “how do I build a compact device that omnidirectionally releases a lot of energy in under 1 sec?”. More generally, while this removes the explicit knowledge of what counts as a biothreat/bomb/etc., it doesn’t necessarily prevent the model from reconstructing that knowledge. The only way to prevent that would be to hobble the models’ general capabilities: either removing all physics/chemistry knowledge, or hobbling their general reasoning/problem-solving skills (to remove their ability to do clever reconstructions).
Now, arguably, LLMs may be forever unable to actually do this sort of “reconstruction”, since it’s an innovation-like task and they’ve been pretty bad at this so far. But:
This is obviously a concern if I’m wrong and LLMs are AGI-complete. In this case, they would eventually attain the skills for doing this kind of reconstruction, and only the aforementioned drastic hobbling would prevent it.
This might be less innovation-flavoured than it seems, because general-knowledge datasets might still contain the “shadows” of the harmful knowledge, or disparate pieces of it, such that its reconstruction is easy. (E. g., general physics knowledge + various scientific experiments and engineering principles + fiction and news articles which mention bombs.)
So: have you tried asking for instructions “indirectly”?
(Though even if this suffices to prevent reconstruction in the relatively dumb models of today, it might not work for tomorrow’s cleverer reasoners.
The core issue is the same as in e. g. Deep Deceptiveness: the circuits the model would use to reconstruct harmful knowledge are the same circuits that make it useful in other contexts, so you can’t have one without the other.)