Yep, we definitely want training-based versions of the same experiments—this is something that’s actively on my mind.
Responding to your bullet points:
I think it’s not very important what the exact frequency is—it just doesn’t tell you much. What does matter are the trends—whether prompt A has a higher frequency than prompt B—and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn’t read too much into the specific numbers, but just think “these prompts gave DMGI-results, those ones didn’t”.
(I currently don’t think the specific numbers are that interesting; I also don’t have unlimited budgets, even if I could increase sample sizes somewhat if needed.)
I tried to be “conservative” in the sense of “only most obviously bad completions count”, though from a risk-management perspective one wants to be conservative in the sense “anything suspicious counts” . So I definitely don’t mean that DMI-completions aren’t bad, and it’s reasonable to count those as getting full marks. (My hunch is that P(DMI) and P(DMGI) correlate quite strongly among prompts, so it doesn’t matter too much.)
I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it. The latter point feels quite strong to me currently.
Would be interested in talking more about this; I could see myself changing my mind here.
I’ll try and gather all of the responses and put them to the git repo. In the meantime, you can already find a good chunk of the responses there (and there should be some for all categories mentioned in the post except for the “Later experiments”).
Not sure what my reasoning was here, I’ll try out this variation, and overall look more into “does the model think about oversight?”
(No particular thoughts here, except yep, this makes things harder, and thanks for flagging it.)
My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”. I completely agree that model-written CoTs are the way to go, and this is probably the single most important thing I’d change if I rewrote the post: just use model-written CoTs in all experiments. This doesn’t really matter, though—it doesn’t show in the post, but I’ve done a lot of testing with them, they give largely the same results, and I basically don’t use human-written CoTs anymore.
“The best model of a thing is the thing itself”, goes the saying :)
A final comment on analogies to full deceptive alignment: Given that there’s quite a bit of fine-tuning (both supervised and RL) being done after pre-training (and this amount likely increases over time), deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well! Maybe someone does a lot of fine-tuning for economically valuable tasks such as acquiring lots of gold coins and this then starts the deceptive alignment story.
In any case, it’s extremely important to know the circumstances under which you get deceptive alignment: e.g. if it turns out to be the case that “pre-trained models aren’t deceptively aligned, but fine-tuning can make them be”, then this isn’t that bad as these things go, as long as we know this is the case.
‘I think it’s not very important what the exact frequency is—it just doesn’t tell you much.’ Totally fair! I guess I’m thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don’t want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss.
‘I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.’ Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, ‘We convinced it to be deceptive and then it was deceptive.’ In all the research so far that I’m aware of, there’s a sense in which the deception is only simulated. It’s still valuable research! I mainly just think that it’s an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they’ll lie in service of. To be clear, I think that your work here is less weakened by ‘we convinced it to be deceptive’ than most other work on deception, and that’s something that makes it especially valuable. I just don’t think it fully sidesteps that limitation.
‘My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”.’ Makes sense!
‘...deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well!’ Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that’s the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major real-world harm.
Thanks for the detailed comment!
Yep, we definitely want training-based versions of the same experiments—this is something that’s actively on my mind.
Responding to your bullet points:
I think it’s not very important what the exact frequency is—it just doesn’t tell you much. What does matter are the trends—whether prompt A has a higher frequency than prompt B—and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn’t read too much into the specific numbers, but just think “these prompts gave DMGI-results, those ones didn’t”.
(I currently don’t think the specific numbers are that interesting; I also don’t have unlimited budgets, even if I could increase sample sizes somewhat if needed.)
I tried to be “conservative” in the sense of “only most obviously bad completions count”, though from a risk-management perspective one wants to be conservative in the sense “anything suspicious counts” . So I definitely don’t mean that DMI-completions aren’t bad, and it’s reasonable to count those as getting full marks. (My hunch is that P(DMI) and P(DMGI) correlate quite strongly among prompts, so it doesn’t matter too much.)
I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it. The latter point feels quite strong to me currently.
Would be interested in talking more about this; I could see myself changing my mind here.
I’ll try and gather all of the responses and put them to the git repo. In the meantime, you can already find a good chunk of the responses there (and there should be some for all categories mentioned in the post except for the “Later experiments”).
Not sure what my reasoning was here, I’ll try out this variation, and overall look more into “does the model think about oversight?”
(No particular thoughts here, except yep, this makes things harder, and thanks for flagging it.)
My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”. I completely agree that model-written CoTs are the way to go, and this is probably the single most important thing I’d change if I rewrote the post: just use model-written CoTs in all experiments. This doesn’t really matter, though—it doesn’t show in the post, but I’ve done a lot of testing with them, they give largely the same results, and I basically don’t use human-written CoTs anymore.
“The best model of a thing is the thing itself”, goes the saying :)
A final comment on analogies to full deceptive alignment: Given that there’s quite a bit of fine-tuning (both supervised and RL) being done after pre-training (and this amount likely increases over time), deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well! Maybe someone does a lot of fine-tuning for economically valuable tasks
such as acquiring lots of gold coinsand this then starts the deceptive alignment story.In any case, it’s extremely important to know the circumstances under which you get deceptive alignment: e.g. if it turns out to be the case that “pre-trained models aren’t deceptively aligned, but fine-tuning can make them be”, then this isn’t that bad as these things go, as long as we know this is the case.
‘I think it’s not very important what the exact frequency is—it just doesn’t tell you much.’ Totally fair! I guess I’m thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don’t want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss.
‘I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I’m looking at the CoTs and saying “this one is deceptive, that one is not”, and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.’ Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, ‘We convinced it to be deceptive and then it was deceptive.’ In all the research so far that I’m aware of, there’s a sense in which the deception is only simulated. It’s still valuable research! I mainly just think that it’s an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they’ll lie in service of. To be clear, I think that your work here is less weakened by ‘we convinced it to be deceptive’ than most other work on deception, and that’s something that makes it especially valuable. I just don’t think it fully sidesteps that limitation.
‘My motivation for human-CoTs was “let’s start by constructing something that definitely should work (if anything does)”.’ Makes sense!
‘...deceptive alignment arising in pre-training isn’t the only worry—you have to worry about it arising in fine-tuning as well!’ Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that’s the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major real-world harm.
Thanks for the thoughtful response!