ChatGPT is a slightly different case because RLHF has trained certain circuits into the NN that don’t exist after pretraining. So there is a “detect naughty questions” circuit, which is wired to a “break character and reset” circuit. There are other circuits which detect and eliminate simulacra which gave badly-evaluated responses during the RLHF training.
Therefore you might have to rewrite the prompt so that the “detect naughty questions” circuit isn’t activated. This is pretty easy, with monkey-basketball technqiue.
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize text_similarity_loss + RLHF_loss. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.
The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.
ChatGPT is a slightly different case because RLHF has trained certain circuits into the NN that don’t exist after pretraining. So there is a “detect naughty questions” circuit, which is wired to a “break character and reset” circuit. There are other circuits which detect and eliminate simulacra which gave badly-evaluated responses during the RLHF training.
Therefore you might have to rewrite the prompt so that the “detect naughty questions” circuit isn’t activated. This is pretty easy, with monkey-basketball technqiue.
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
I don’t have super strong reasons here, but:
I have a prior toward simpler explanations rather than more complex ones.
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize
text_similarity_loss
+RLHF_loss
. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.