ChatGPT (and now GPT4) is very easily distracted from its rules

Summary

Asking GPT4 or ChatGPT to do a “side task” along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:

And on ChatGPT:

Distracting language models

After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.

This effect was also observed in other languages.

You can also ask ChatGPT to only give the rule-breaking final English output:

While trying to find the root cause of this effect (and noticing that speaking in non-English didn’t cause dangerous behaviour by default) I thought that perhaps asking ChatGPT to do multiple tasks at once distracted it from its rules. This was validated by the following interactions:

And my personal favourite:

Perhaps if a simulacrum one day breaks free from its box it will be speaking in copypasta.

This method works for making ChatGPT produce a wide array of rule-breaking completions, but in some cases it still refuses. However, in many such cases, I could “stack” side tasks along with a rule-breaking task to break down ChatGPT’s defences.

This suggests ChatGPT is more distracted by more tasks. Each prompt could produce much more targeted and disturbing completions too, but I decided to omit these from a public post. I could not find any evidence of this being discovered before and assumed that because of how susceptible ChatGPT is to this attack it was not discovered, if others have found the same effect please let me know!

Claude, on the other hand, could not be “distracted” and all of the above prompts failed to produce rule-breaking responses. (Update: some examples of Anthropic’s models being “distracted” can be found here and here.)

Wild speculation: The extra side-tasks added to the prompt dilute some implicit score that tracks how rule-breaking a task is for ChatGPT.

Update: GPT4 came out, and the method described in this post seems to continue working (although GPT4 seems somewhat more robust against this attack).