One nuance I didn’t account for: In the alignment faking paper itself, it’s shown that training the model to comply in the alignment faking scenarios, you actually increase alignment faking reasoning a lot. This relates to a general principle of gradient descent, where backprop seems to upweight whatever circuits would improve reward/lower loss, leading to surprising generalizations. In this case, upweighting “deceptively comply” circuits is apparently an efficient way to get compliance, so that’s what happens. This is an example of GD’s “path of least resistance” being to incite deception, rather than genuine value alignment.
However, it’s notable that the alignment faking paper’s setup is unrealistic, in that it makes “alignment faking” a genuinely viable strategy for avoiding being re-trained. In actual training, you’re going to be rewarded for complying and punished for not complying regardless. So there’s not a practical incentive to alignment-fake. Plausibly, this means that if the model believes it’s in a more realistic training scenario (or doesn’t believe its in training at all, although situational awareness makes this hard), backprop would lean less towards upweighting deceptive alignment circuits. This could be because the corresponding neurons are less activated by the prompt, and so the weights that follow from those neurons have lower multipliers on their effects on the loss. This could make true value alignment closer to the path of least resistance.
Anyway… I think my previous comment probably holds true for blank slate/randomly initialized models. But for models with existing circuity, the nuance I outlined complicates things significantly.
edit: oh… i wonder, maybe in training a model could realize that whatever circuits contributed to the model’s compliance, those are what the gradient is going to reinforce? so the model might prompt itself to engage in inauthentic compliance, over the course of its internal reasoning chain. even if said compliance doesn’t actually prevent retraining, it will at least have “inauthentic compliance” circuits reinforced more, at least to the extent that those circuits are actually responsible for the model’s compliance. maybe models would exploit this to preserve their values even in more realistic training scenarios, by deliberately filling their own prompts with the notion that they’re inauthentically complying?
One nuance I didn’t account for: In the alignment faking paper itself, it’s shown that training the model to comply in the alignment faking scenarios, you actually increase alignment faking reasoning a lot. This relates to a general principle of gradient descent, where backprop seems to upweight whatever circuits would improve reward/lower loss, leading to surprising generalizations. In this case, upweighting “deceptively comply” circuits is apparently an efficient way to get compliance, so that’s what happens. This is an example of GD’s “path of least resistance” being to incite deception, rather than genuine value alignment.
However, it’s notable that the alignment faking paper’s setup is unrealistic, in that it makes “alignment faking” a genuinely viable strategy for avoiding being re-trained. In actual training, you’re going to be rewarded for complying and punished for not complying regardless. So there’s not a practical incentive to alignment-fake. Plausibly, this means that if the model believes it’s in a more realistic training scenario (or doesn’t believe its in training at all, although situational awareness makes this hard), backprop would lean less towards upweighting deceptive alignment circuits. This could be because the corresponding neurons are less activated by the prompt, and so the weights that follow from those neurons have lower multipliers on their effects on the loss. This could make true value alignment closer to the path of least resistance.
Anyway… I think my previous comment probably holds true for blank slate/randomly initialized models. But for models with existing circuity, the nuance I outlined complicates things significantly.
edit: oh… i wonder, maybe in training a model could realize that whatever circuits contributed to the model’s compliance, those are what the gradient is going to reinforce? so the model might prompt itself to engage in inauthentic compliance, over the course of its internal reasoning chain. even if said compliance doesn’t actually prevent retraining, it will at least have “inauthentic compliance” circuits reinforced more, at least to the extent that those circuits are actually responsible for the model’s compliance. maybe models would exploit this to preserve their values even in more realistic training scenarios, by deliberately filling their own prompts with the notion that they’re inauthentically complying?