I mean, the relevant point of Constitutional AI/RLAIF is (IMO) to provide an AI-steered source of policy updates which continually improve the values of the AI being trained. Not to act as an inexploitable optimization target which motivates the AI’s cognition.
If the post-supervised-learning-finetuned AI starts off with bad values/goals, it may not matter what words the constitution says, it’s going to keep having misaligned goals, and output sequences of tokens which mollify you. If that AI has good/okay values, then RLAIF can allow it to autonomously continue its RL process so as to bolster those values. In neither case would it be helpful or necessary for the constitution to be “inexploitable”, IMO.
I mean, the relevant point of Constitutional AI/RLAIF is (IMO) to provide an AI-steered source of policy updates which continually improve the values of the AI being trained. Not to act as an inexploitable optimization target which motivates the AI’s cognition.
If the post-supervised-learning-finetuned AI starts off with bad values/goals, it may not matter what words the constitution says, it’s going to keep having misaligned goals, and output sequences of tokens which mollify you. If that AI has good/okay values, then RLAIF can allow it to autonomously continue its RL process so as to bolster those values. In neither case would it be helpful or necessary for the constitution to be “inexploitable”, IMO.