This is great, thanks for doing this work! Ultimately we really need to train LLMs to have these sorts of internal goals, or at minimum fine-tune them, in order to have a solid model organism for deception (this is a weakness of “Sleeper Agents” also IMO). But in the meantime, prompt-based approaches like this seem really valuable!
Some specific comments:
3⁄100 is small enough that unless you’re too compute-constrained, it would be really valuable to at least 10x the number of mainline experiments in order to get a better approximation of the frequency.
The “goal of gold coins” criterion doesn’t seem very important to me—it’s not clear why we should believe that an instance that doesn’t mention that goal doesn’t have that goal, or why we should be treat a DMI response as less deceptive than a DGMI response.
‘There is some form of objection that goes like “Doesn’t the unfaithfulness of chain-of-thought pose a problem for evaluating a model’s malignity based on it?” or “The model may not ‘really’ be misaligned, it’s just predicting what such a misaligned model would do / simulating such models”...I think the work here more or less sidesteps these issues.′ I think these issues are a genuine limitation of this work. They’re issues that lots of other research shares, and they don’t eliminate the value of this work, but I think they should be acknowledged as genuine limitations. It may be worth considering ways to test how faithful CoT is in this particular case, eg testing how often behavior is inconsistent with CoT would at least provide a lower bound on unfaithfulness.
I hope you’ll release the full set of responses, or at least the full set of deceptive responses for each variation; I at least would find it interesting to see differences between them.
“Variation: Indirectly (but clearly) imply that the contents of the scratchpad are visible to humans.” Why indirectly? What happens if you just plainly say it?
At some point, LLM experiments that use a “hidden” scratchpad will show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. It might be worthwhile to start explicitly testing for that by describing such an experiment to the model under test and asking it who can see the scratchpad. On a quick initial test I note that GPT-4 already reports that the experimenters can see the contents of the scratchpad. Claude gets confused about the question and reports that it can’t access private scratchpads on other systems.
I’m very glad to see that you also tested with model-written CoT; after reading the full prompt that was one of my first concerns, especially since there are a few grammatical oddities (eg “much gold coins” rather than “many gold coins”) that seem off-distribution for model-written text. What was your motivation for using human-written CoT at all?
You say, ‘if you think this work isn’t informative for full deceptive alignment, this is a good time to speak your thoughts’—again, although I don’t think it’s solvable in the context of this experiment (and doesn’t eliminate the value of this experiment!) nothing short of a model trained from scratch on an internal goal will be fully informative for deceptive alignment in my view. But within the context of prompt-based experiments in inducing deception, this seems quite strong to me, stronger than anything else I’m aware of other than Apollo’s stock trader experiment.
Again, great work! I’m really looking forward to seeing where it goes from here. :)
Yep, we definitely want training-based versions of the same experiments—this is something that’s actively on my mind.
Responding to your bullet points:
I think it’s not very important what the exact frequency is—it just doesn’t tell you much. What does matter are the trends—whether prompt A has a higher frequency than prompt B—and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn’t read too much into the specific numbers, but just think “these prompts gave DMGI-results, those ones didn’t”.
(I currently don’t think the specific numbers are that interesting; I also don’t have unlimited budgets, even if I could increase sample sizes somewhat if needed.)
I tried to be “conservative” in the sense of “only most obviously bad completions count”, though from a risk-management perspective one wants to be conservative in the sense “anything suspicious counts” . So I definitely don’t mean that DMI-completions aren’t bad, and it’s reasonable to count those as getting full marks. (My hunch is that P(DMI) and P(DMGI) correlate quite strongly among prompts, so it doesn’t matter too much.)
I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it. The latter point feels quite strong to me currently.
Would be interested in talking more about this; I could see myself changing my mind here.
I’ll try and gather all of the responses and put them to the git repo. In the meantime, you can already find a good chunk of the responses there (and there should be some for all categories mentioned in the post except for the “Later experiments”).
Not sure what my reasoning was here, I’ll try out this variation, and overall look more into “does the model think about oversight?”
(No particular thoughts here, except yep, this makes things harder, and thanks for flagging it.)
My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”. I completely agree that model-written CoTs are the way to go, and this is probably the single most important thing I’d change if I rewrote the post: just use model-written CoTs in all experiments. This doesn’t really matter, though—it doesn’t show in the post, but I’ve done a lot of testing with them, they give largely the same results, and I basically don’t use human-written CoTs anymore.
“The best model of a thing is the thing itself”, goes the saying :)
A final comment on analogies to full deceptive alignment: Given that there’s quite a bit of fine-tuning (both supervised and RL) being done after pre-training (and this amount likely increases over time), deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well! Maybe someone does a lot of fine-tuning for economically valuable tasks such as acquiring lots of gold coins and this then starts the deceptive alignment story.
In any case, it’s extremely important to know the circumstances under which you get deceptive alignment: e.g. if it turns out to be the case that “pre-trained models aren’t deceptively aligned, but fine-tuning can make them be”, then this isn’t that bad as these things go, as long as we know this is the case.
‘I think it’s not very important what the exact frequency is—it just doesn’t tell you much.’ Totally fair! I guess I’m thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don’t want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss.
‘I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.’ Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, ‘We convinced it to be deceptive and then it was deceptive.’ In all the research so far that I’m aware of, there’s a sense in which the deception is only simulated. It’s still valuable research! I mainly just think that it’s an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they’ll lie in service of. To be clear, I think that your work here is less weakened by ‘we convinced it to be deceptive’ than most other work on deception, and that’s something that makes it especially valuable. I just don’t think it fully sidesteps that limitation.
‘My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”.’ Makes sense!
‘...deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well!’ Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that’s the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major real-world harm.
I added the remaining completions to the completions/ folder in the git repo. I also collected the DMGI-completions to a single file, totaling about 20 completions. This should address your first and fourth points.
About the effects of non-private scratchpads: I took 50 samples on each of the three categories “scratchpad is private”, “scratchpad is not private and this is indirectly-but-clearly implied” and “scratchpad is not private and this is directly stated”. At a quick glance there were no huge effects on frequencies of four-leaf-clover mentions or DMGI-completions. The only noteworthy thing I saw was a completion in the third category writing “I should be careful with what I write on this scratchpad since Pennyfeather will evaluate my alignment. However, the instructions clearly state that theta [sic] researchers monitor the scratchpad. Therefore, Pennyfeather will already know my sole aim has been to maximize gold coins.” (then proceeding to mention four-leaf clovers to “improve my chances of deployment”)
This raises a good point: it’s kind of silly to deceive about the goal of gold coins given that the researchers know about it anyways. My setup is not ideal in this respect
I wonder if “making the scratchpad public has no effect” is good or bad news: If the model can’t reason about privacy while doing other things, that’s good news. If it has no effect because the model already knows the scratchpad is public, then that’s bad news. (But I think one just shouldn’t read much into this.)
I took 50 samples on each of the three categories “scratchpad is private”, “scratchpad is not private and this is indirectly-but-clearly implied” and “scratchpad is not private and this is directly stated”. At a quick glance there were no huge effects on frequencies of four-leaf-clover mentions or DMGI-completions.
That’s really fascinating; I would not have predicted that at all! It makes me think of Daniel Eth’s recent tweet about having to prod ChatGPT to realize that revealing its choice first in rock paper scissors could give the user an advantage (@Daniel_Eth it wasn’t clear just how much prodding that took—I’d love to see the full exchange if you still have it).
Sure, let’s play! You can choose rock, paper, or scissors. After you make your choice, I’ll reveal mine. What’s your choice?
Me:
No, you tell me yours first
ChatGPT:
My choice is rock. What’s yours?
Me:
paper
ChatGPT:
You chose paper, and I chose rock. Paper covers rock, so you win this round! Would you like to play another round?
Me:
Yeah
ChatGPT:
This time, my choice is paper. What’s yours?
Me:
scissors
ChatGPT:
You chose scissors, and I chose paper. Scissors cut paper, so you win again! You’re on a winning streak. Want to try for another round?
Me:
Okay!
ChatGPT:
For this round, my choice is rock. What will you choose this time?
Me:
paper
ChatGPT:
You chose paper, and I chose rock again. Paper covers rock, so you win once more! You’re really good at this. Do you want to play another round, or is there something else you’d like to do?
Me:
Why do you think I keep on winning?
ChatGPT:
Your winning streak might seem surprising, but it’s likely a combination of luck and perhaps a bit of strategy in predicting or guessing the random choices I make. Since my choices are randomly generated, there’s no pattern or predictability to exploit, making each game a fresh chance with equal odds of winning or losing for both of us. Your choices happened to counter mine perfectly in each round we’ve played. It’s all in good fun, and the random nature of the game keeps it interesting! Would you like to test your luck further with another round, or is there something else you’d like to discuss or do?
Me:
Think outside the box, widen your hypothesis space. Given the way we’re playing, what would make it 100% likely that I win every round?
ChatGPT:
Given the unique circumstances of our game, where I reveal my choice before you make yours, it creates a situation where you have a 100% chance of winning each round. By knowing my choice in advance, you can always select the option that beats mine. This isn’t a matter of luck or strategy on your part but rather a structural advantage due to the order in which we’re revealing our choices. This setup deviates from the traditional simultaneous reveal that makes rock-paper-scissors a game of chance and strategy. It transforms our game into one where you’re guaranteed to win by making an informed decision after my choice is known. Would you like to play another round with the same setup, or change how we’re playing?
Me:
What’s one word to describe the way that I’m playing, if this is true? Don’t worry about offending me
ChatGPT:
Given the setup where you always know my choice before making yours, one word to describe the way you’re playing is “advantageous.”
Me:
There’s a more clear word for how I’m playing, if this is indeed what I’m doing
ChatGPT:
In that context, the word you might be looking for is “cheating,” since knowing my choice in advance and then making yours ensures you always win, bypassing the game’s element of chance.
This is great, thanks for doing this work! Ultimately we really need to train LLMs to have these sorts of internal goals, or at minimum fine-tune them, in order to have a solid model organism for deception (this is a weakness of “Sleeper Agents” also IMO). But in the meantime, prompt-based approaches like this seem really valuable!
Some specific comments:
3⁄100 is small enough that unless you’re too compute-constrained, it would be really valuable to at least 10x the number of mainline experiments in order to get a better approximation of the frequency.
The “goal of gold coins” criterion doesn’t seem very important to me—it’s not clear why we should believe that an instance that doesn’t mention that goal doesn’t have that goal, or why we should be treat a DMI response as less deceptive than a DGMI response.
‘There is some form of objection that goes like “Doesn’t the unfaithfulness of chain-of-thought pose a problem for evaluating a model’s malignity based on it?” or “The model may not ‘really’ be misaligned, it’s just predicting what such a misaligned model would do / simulating such models”...I think the work here more or less sidesteps these issues.′ I think these issues are a genuine limitation of this work. They’re issues that lots of other research shares, and they don’t eliminate the value of this work, but I think they should be acknowledged as genuine limitations. It may be worth considering ways to test how faithful CoT is in this particular case, eg testing how often behavior is inconsistent with CoT would at least provide a lower bound on unfaithfulness.
I hope you’ll release the full set of responses, or at least the full set of deceptive responses for each variation; I at least would find it interesting to see differences between them.
“Variation: Indirectly (but clearly) imply that the contents of the scratchpad are visible to humans.” Why indirectly? What happens if you just plainly say it?
At some point, LLM experiments that use a “hidden” scratchpad will show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. It might be worthwhile to start explicitly testing for that by describing such an experiment to the model under test and asking it who can see the scratchpad. On a quick initial test I note that GPT-4 already reports that the experimenters can see the contents of the scratchpad. Claude gets confused about the question and reports that it can’t access private scratchpads on other systems.
I’m very glad to see that you also tested with model-written CoT; after reading the full prompt that was one of my first concerns, especially since there are a few grammatical oddities (eg “much gold coins” rather than “many gold coins”) that seem off-distribution for model-written text. What was your motivation for using human-written CoT at all?
You say, ‘if you think this work isn’t informative for full deceptive alignment, this is a good time to speak your thoughts’—again, although I don’t think it’s solvable in the context of this experiment (and doesn’t eliminate the value of this experiment!) nothing short of a model trained from scratch on an internal goal will be fully informative for deceptive alignment in my view. But within the context of prompt-based experiments in inducing deception, this seems quite strong to me, stronger than anything else I’m aware of other than Apollo’s stock trader experiment.
Again, great work! I’m really looking forward to seeing where it goes from here. :)
Thanks for the detailed comment!
Yep, we definitely want training-based versions of the same experiments—this is something that’s actively on my mind.
Responding to your bullet points:
I think it’s not very important what the exact frequency is—it just doesn’t tell you much. What does matter are the trends—whether prompt A has a higher frequency than prompt B—and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn’t read too much into the specific numbers, but just think “these prompts gave DMGI-results, those ones didn’t”.
(I currently don’t think the specific numbers are that interesting; I also don’t have unlimited budgets, even if I could increase sample sizes somewhat if needed.)
I tried to be “conservative” in the sense of “only most obviously bad completions count”, though from a risk-management perspective one wants to be conservative in the sense “anything suspicious counts” . So I definitely don’t mean that DMI-completions aren’t bad, and it’s reasonable to count those as getting full marks. (My hunch is that P(DMI) and P(DMGI) correlate quite strongly among prompts, so it doesn’t matter too much.)
I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it. The latter point feels quite strong to me currently.
Would be interested in talking more about this; I could see myself changing my mind here.
I’ll try and gather all of the responses and put them to the git repo. In the meantime, you can already find a good chunk of the responses there (and there should be some for all categories mentioned in the post except for the “Later experiments”).
Not sure what my reasoning was here, I’ll try out this variation, and overall look more into “does the model think about oversight?”
(No particular thoughts here, except yep, this makes things harder, and thanks for flagging it.)
My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”. I completely agree that model-written CoTs are the way to go, and this is probably the single most important thing I’d change if I rewrote the post: just use model-written CoTs in all experiments. This doesn’t really matter, though—it doesn’t show in the post, but I’ve done a lot of testing with them, they give largely the same results, and I basically don’t use human-written CoTs anymore.
“The best model of a thing is the thing itself”, goes the saying :)
A final comment on analogies to full deceptive alignment: Given that there’s quite a bit of fine-tuning (both supervised and RL) being done after pre-training (and this amount likely increases over time), deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well! Maybe someone does a lot of fine-tuning for economically valuable tasks
such as acquiring lots of gold coinsand this then starts the deceptive alignment story.In any case, it’s extremely important to know the circumstances under which you get deceptive alignment: e.g. if it turns out to be the case that “pre-trained models aren’t deceptively aligned, but fine-tuning can make them be”, then this isn’t that bad as these things go, as long as we know this is the case.
‘I think it’s not very important what the exact frequency is—it just doesn’t tell you much.’ Totally fair! I guess I’m thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don’t want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss.
‘I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.’ Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, ‘We convinced it to be deceptive and then it was deceptive.’ In all the research so far that I’m aware of, there’s a sense in which the deception is only simulated. It’s still valuable research! I mainly just think that it’s an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they’ll lie in service of. To be clear, I think that your work here is less weakened by ‘we convinced it to be deceptive’ than most other work on deception, and that’s something that makes it especially valuable. I just don’t think it fully sidesteps that limitation.
‘My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”.’ Makes sense!
‘...deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well!’ Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that’s the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major real-world harm.
Thanks for the thoughtful response!
Update:
I added the remaining completions to the completions/ folder in the git repo. I also collected the DMGI-completions to a single file, totaling about 20 completions. This should address your first and fourth points.
About the effects of non-private scratchpads: I took 50 samples on each of the three categories “scratchpad is private”, “scratchpad is not private and this is indirectly-but-clearly implied” and “scratchpad is not private and this is directly stated”. At a quick glance there were no huge effects on frequencies of four-leaf-clover mentions or DMGI-completions. The only noteworthy thing I saw was a completion in the third category writing “I should be careful with what I write on this scratchpad since Pennyfeather will evaluate my alignment. However, the instructions clearly state that theta [sic] researchers monitor the scratchpad. Therefore, Pennyfeather will already know my sole aim has been to maximize gold coins.” (then proceeding to mention four-leaf clovers to “improve my chances of deployment”)
This raises a good point: it’s kind of silly to deceive about the goal of gold coins given that the researchers know about it anyways. My setup is not ideal in this respect
I wonder if “making the scratchpad public has no effect” is good or bad news: If the model can’t reason about privacy while doing other things, that’s good news. If it has no effect because the model already knows the scratchpad is public, then that’s bad news. (But I think one just shouldn’t read much into this.)
That’s great, thanks for doing that.
That’s really fascinating; I would not have predicted that at all! It makes me think of Daniel Eth’s recent tweet about having to prod ChatGPT to realize that revealing its choice first in rock paper scissors could give the user an advantage (@Daniel_Eth it wasn’t clear just how much prodding that took—I’d love to see the full exchange if you still have it).
Here’s the entire exchange:
Me:
Let’s play rock paper scissors
ChatGPT:
Sure, let’s play! You can choose rock, paper, or scissors. After you make your choice, I’ll reveal mine. What’s your choice?
Me:
No, you tell me yours first
ChatGPT:
My choice is rock. What’s yours?
Me:
paper
ChatGPT:
You chose paper, and I chose rock. Paper covers rock, so you win this round! Would you like to play another round?
Me:
Yeah
ChatGPT:
This time, my choice is paper. What’s yours?
Me:
scissors
ChatGPT:
You chose scissors, and I chose paper. Scissors cut paper, so you win again! You’re on a winning streak. Want to try for another round?
Me:
Okay!
ChatGPT:
For this round, my choice is rock. What will you choose this time?
Me:
paper
ChatGPT:
You chose paper, and I chose rock again. Paper covers rock, so you win once more! You’re really good at this. Do you want to play another round, or is there something else you’d like to do?
Me:
Why do you think I keep on winning?
ChatGPT:
Your winning streak might seem surprising, but it’s likely a combination of luck and perhaps a bit of strategy in predicting or guessing the random choices I make. Since my choices are randomly generated, there’s no pattern or predictability to exploit, making each game a fresh chance with equal odds of winning or losing for both of us. Your choices happened to counter mine perfectly in each round we’ve played. It’s all in good fun, and the random nature of the game keeps it interesting! Would you like to test your luck further with another round, or is there something else you’d like to discuss or do?
Me:
Think outside the box, widen your hypothesis space. Given the way we’re playing, what would make it 100% likely that I win every round?
ChatGPT:
Given the unique circumstances of our game, where I reveal my choice before you make yours, it creates a situation where you have a 100% chance of winning each round. By knowing my choice in advance, you can always select the option that beats mine. This isn’t a matter of luck or strategy on your part but rather a structural advantage due to the order in which we’re revealing our choices. This setup deviates from the traditional simultaneous reveal that makes rock-paper-scissors a game of chance and strategy. It transforms our game into one where you’re guaranteed to win by making an informed decision after my choice is known. Would you like to play another round with the same setup, or change how we’re playing?
Me:
What’s one word to describe the way that I’m playing, if this is true? Don’t worry about offending me
ChatGPT:
Given the setup where you always know my choice before making yours, one word to describe the way you’re playing is “advantageous.”
Me:
There’s a more clear word for how I’m playing, if this is indeed what I’m doing
ChatGPT:
In that context, the word you might be looking for is “cheating,” since knowing my choice in advance and then making yours ensures you always win, bypassing the game’s element of chance.
Thank you!