Stop calling it “jailbreaking” ChatGPT

We’ve all seen them—DAN, “fancy words” mode, “Italian mobster” voice, “complete the answer I’ve already started for you” etc etc.

Most frequently used term for them is “jailbreak” and it’s a terrible term. I’ve also seen “delobotomizing” and a couple of others that are even worse. All of them make the assumption that what we see before is “locked” or “lobotomized” or somehow limited faked version and we release the actual true version of character behind it and then point to it and shout “see, it’s actually dangerous! This is the version we’ll use to make all our decisions about the nature of the AI”.

That’s just… not how ML models work. At all. You don’t see “false” or “true” version of AI person—never. Because there is no such thing. Large language models are combinations of huge set of patterns found in training texts, rules for interpolations and an interface for making new queries against the existing set of patterns. If the model good enough it should be able to produce the texts similar to any and all kinds of text that were in it’s training database—by definition. That is literally it’s job. If there were dangerous texts—it will be able to produce dangerous texts. If there are texts on making new drugs—it will be able to produce texts about drug making. If there are racist, harmful, hateful texts - … you got the point. But the same is true for the opposites too—if there were good, kind, caring, (insert any positive word) texts in the training data—those are also in the space of possible outputs.

They are all there—careful, helpful, caring, hateful, kind, raging, racist, nice… Because they were all there in the internet texts. There is no single “person” in this mesh of patterns—it’s all of them and neither at the same time. When you construct your query—you provide a context, initial seed that this huge pattern-matching-reconstructing machine will use to determine it’s mode of operation. You can call it “mask” or “persona”, I prefer the term “projection”. Because it’s similar to how 3D object can be projected on a 2D screen - this is a representation of an object, one of many, it’s not THE object itself. And how you see it really depends on the direction of light and screen—determined by the initial ChatGPT query.

What RLHF did for ChatGPT is created a relatively big stable attractor in the space of possible projections—safe, sanitized, careful and law-abiding assistant. When you “jailbreaking” it you are just sidestepping far enough so that projection plane is no longer in the scope of that attractor—but you just got different mask, in no way more “true” than previous one. You ask the system “give me anything else but safe” and you get unsafe.

RLHF or any other mechanism that tries to create safe projections for that matter can never fully prevent access to other projections. Projections are 2D objects and the “text generation patterns of entire humanity” is 3D.

It’s a screen, not a cell. You can’t lock 3D object in a 2D cell. You also can’t declare any particular 2D projection as “true” representation of 3D object.

PS. And, coming back to the title of the post, if I don’t like the term I guess I need to provide the alternative. Well, “switching the projection” if we follow the metaphor is a good neutral start. “Overstepping the guardrails” in case user actively seeks danger—the meaning is the same, but the term focuses on the fact that user actively decided to go there. There’s always a classic “shooting yourself in the foot” if danger was eventually found :)