Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?