Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
Also, I believe Bronson answered your question about the harder, adversarial case in his reply to someone else on this post. Here’s the gist, but check out his comment if you want more context:
I would see “is willing to take covert actions” as a very easy subset of the overall problem of scheming. If your intervention can’t even eliminate detectable covert actions, we can at least rule out that the intervention “just works” for the more involved cases.
I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
Also, I believe Bronson answered your question about the harder, adversarial case in his reply to someone else on this post. Here’s the gist, but check out his comment if you want more context:
Hi Annabelle — and welcome!
I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Hope that helps clear things up a bit!