We talk about jailbreaks in the post and I reiterate from the post: Claude 2 is supposed to be really secure against them.
Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.
I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.
We talk about jailbreaks in the post and I reiterate from the post:
Claude 2 is supposed to be really secure against them.
Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.
I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.