I’m not sure passing laws based on private information that would be harmful to reveal is actually extraordinary in a democratic society specifically within security? For example, laws about nuclear secrets or spying are often based on what intelligence agencies know about the threats.
But I do still think the direction you’re describing is a very valuable one, and I would be enthusiastic about EA-influenced biosecurity research organizations taking that approach.
On the policy proposal, my understanding is the argument is essentially:
Raw LLMs will meet both your principles soon if not already.
If the “keep LLMs from saying harmful things” approach is successful, finished LLMs will have safeguards so they won’t meet those principles.
Public weights currently allow easy safeguard removal.
Liability for misuse (a) encourages research into safeguards that can’t simply be removed, (b) incentivizes companies generating and considering releasing weights (or their insurers) to do a good job estimating how likely misuse is, (c) discourages further weight release while companies develop better estimates of misuse likelihood.
As far as I know, (1) is the only step here influenced by information that would be hazardous to publish. But I do think it’s possible to reasonably safely make progress on understanding how well models meet those principles and how it’s likely to change over time, through more thorough risk evaluations than the one in the Gopal paper.
I’m not sure passing laws based on private information that would be harmful to reveal is actually extraordinary in a democratic society specifically within security? For example, laws about nuclear secrets or spying are often based on what intelligence agencies know about the threats.
But I do still think the direction you’re describing is a very valuable one, and I would be enthusiastic about EA-influenced biosecurity research organizations taking that approach.
On the policy proposal, my understanding is the argument is essentially:
Raw LLMs will meet both your principles soon if not already.
If the “keep LLMs from saying harmful things” approach is successful, finished LLMs will have safeguards so they won’t meet those principles.
Public weights currently allow easy safeguard removal.
Liability for misuse (a) encourages research into safeguards that can’t simply be removed, (b) incentivizes companies generating and considering releasing weights (or their insurers) to do a good job estimating how likely misuse is, (c) discourages further weight release while companies develop better estimates of misuse likelihood.
As far as I know, (1) is the only step here influenced by information that would be hazardous to publish. But I do think it’s possible to reasonably safely make progress on understanding how well models meet those principles and how it’s likely to change over time, through more thorough risk evaluations than the one in the Gopal paper.