Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have).
This would encourage the model to always verbalize when it is about to fake alignment.
Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). This would encourage the model to always verbalize when it is about to fake alignment. Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
How would the model mention rejection but still fake alignment? That would be easy to catch.