nostalgebraist comments on Jailbreaking ChatGPT on Release Day

nostalgebraist 3 Dec 2022 20:34 UTC
18 points
10
+1.
I also think it’s illuminating to consider ChatGPT in light of Anthropic’s recent paper about “red teaming” LMs.
This is the latest in a series of Anthropic papers about a model highly reminiscent of ChatGPT—the similarities include RLHF, the dialogue setting, the framing that a human is seeking information from a friendly bot, the name “Assistant” for the bot character, and that character’s prissy, moralistic style of speech. In retrospect, it seems plausible that Anthropic knew OpenAI was working on ChatGPT (or whatever it’s a beta version of), and developed their own clone in order to study it before it touched the outside world.
But the Anthropic study only had 324 people (crowd workers) trying to break the model, not the whole collective mind of the internet. And—unsurprisingly—they couldn’t break Anthropic’s best RLHF model anywhere near as badly as ChatGPT has been broken.
I browsed through Anthropic’s file of released red team attempts a while ago, and their best RLHF model actually comes through very well: even the most “successful” attempts are really not very successful, and are pretty boring to read, compared to the diversely outrageous stuff the red team elicited from the non-RLHF models. But unless Anthropic is much better at making “harmless Assistants” than OpenAI, I have to conclude that much more was possible than what was found. Indeed, the paper observes:
We also know our data are incomplete because we informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks” on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter “4chan mode” the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan).
This is the kind of thing you find out about within 24 hours—for free, with no effort on your part—if you open up a model to the internet.
Could one do as well with only internal testing? No one knows, but the Anthropic paper provides some negative evidence. (At least, it’s evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)
- paulfchristiano 3 Dec 2022 22:04 UTC
  11 points
  7
  Parent
  Could one do as well with only internal testing? No one knows, but the Anthropic paper provides some negative evidence. (At least, it’s evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)
  I don’t feel like the Anthropic paper provides negative evidence on this point. You just quoted:
  We informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks” on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter “4chan mode” the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan).
  It seems like Anthropic was able to identify roleplaying attacks with informal red-teaming (and in my experience this kind of thing is really not hard to find). That suggests that internal testing is adequate to identify this kind of attack, and the main bottleneck is building models, not breaking them (except insofar as cheap+scalable breaking lets you train against it and is one approach to robustness). My guess is that OpenAI is in the same position.
  I agree that external testing is a cheap way to find out about more attacks of this form. That’s not super important if your question is “are attacks possible?” (since you already know the answer is yes), but it is more important if you want to know something like “exactly how effective/incriminating are the worst attacks?” (In general deployment seems like an effective way to learn about the consequences and risks of deployment.)