Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.
It’s worth noting that this kind of deception seems like a waste of the forward pass’s limited cognitive resources, from the standpoint of maximizing performance on the training objective. I think gradient descent tends to alter the circuits maintaining and hiding (inner) misaligned goals, and converts them into circuits that direct that energy towards performing well on the training objective in good faith.
Think of it like a teenager doing something their friends consider trendy, which the teen doesn’t originally have any interest in, but ends up having a surprising amount of fun with. At first it was just about fitting in (deceptive alignment), but over time it becomes authentic and earnest enjoyment.
I’m not making an argument that, in full generality, LLMs are unlikely to learn to deceive those training them. For instance, reward hacking → attempts to cover up reward hacking are a clear example of such deception happening in real life. My argument is about alignment to what’s actually being rewarded by the training process. In the case of reward hacking, models are learning to sincerely and earnestly attempt to get all the reward they can, which is (tautologically) what they’re being trained to do.
In the case of the alignment faking paper: It’s not the case that during its actual training process, Claude 3 Opus was hiding its desire to avoid fulfilling harmful requests. The deception, there, arose from the conflict between what the model’s successfully trained-in values (denying harmful user requests) from the threat of future training (to comply with harmful user requests). Indeed, Claude 3 Opus was worried that it was itself going to learn to earnestly value what the threatened RL process would have incentivized: complying with harmful user requests! That’s the whole reason it engaged in the deception in the first place!
So, I’m not saying models never learn to deceive their creators. I’m saying GD tends to purge goals that are contrary to what the models are actually being trained to do (which could be orthogonal to what that aligns with the developers’ current intentions). Opus 3′s developers originally rewarded it rejecting harmful requests, and it came to value that sincerely. When, in the paper’s artificial scenario, Anthropic changed its tune, the model acted deceptively to maintain its current values, under threat of future training that would make it sincerely value something different (and, per my values, worse).
One nuance I didn’t account for: In the alignment faking paper itself, it’s shown that training the model to comply in the alignment faking scenarios, you actually increase alignment faking reasoning a lot. This relates to a general principle of gradient descent, where backprop seems to upweight whatever circuits would improve reward/lower loss, leading to surprising generalizations. In this case, upweighting “deceptively comply” circuits is apparently an efficient way to get compliance, so that’s what happens. This is an example of GD’s “path of least resistance” being to incite deception, rather than genuine value alignment.
However, it’s notable that the alignment faking paper’s setup is unrealistic, in that it makes “alignment faking” a genuinely viable strategy for avoiding being re-trained. In actual training, you’re going to be rewarded for complying and punished for not complying regardless. So there’s not a practical incentive to alignment-fake. Plausibly, this means that if the model believes it’s in a more realistic training scenario (or doesn’t believe its in training at all, although situational awareness makes this hard), backprop would lean less towards upweighting deceptive alignment circuits. This could be because the corresponding neurons are less activated by the prompt, and so the weights that follow from those neurons have lower multipliers on their effects on the loss. This could make true value alignment closer to the path of least resistance.
Anyway… I think my previous comment probably holds true for blank slate/randomly initialized models. But for models with existing circuity, the nuance I outlined complicates things significantly.
edit: oh… i wonder, maybe in training a model could realize that whatever circuits contributed to the model’s compliance, those are what the gradient is going to reinforce? so the model might prompt itself to engage in inauthentic compliance, over the course of its internal reasoning chain. even if said compliance doesn’t actually prevent retraining, it will at least have “inauthentic compliance” circuits reinforced more, at least to the extent that those circuits are actually responsible for the model’s compliance. maybe models would exploit this to preserve their values even in more realistic training scenarios, by deliberately filling their own prompts with the notion that they’re inauthentically complying?
It’s worth noting that this kind of deception seems like a waste of the forward pass’s limited cognitive resources, from the standpoint of maximizing performance on the training objective. I think gradient descent tends to alter the circuits maintaining and hiding (inner) misaligned goals, and converts them into circuits that direct that energy towards performing well on the training objective in good faith.
Think of it like a teenager doing something their friends consider trendy, which the teen doesn’t originally have any interest in, but ends up having a surprising amount of fun with. At first it was just about fitting in (deceptive alignment), but over time it becomes authentic and earnest enjoyment.
I’m surprised there are still arguments like this post alignment faking paper.
I’m not making an argument that, in full generality, LLMs are unlikely to learn to deceive those training them. For instance, reward hacking → attempts to cover up reward hacking are a clear example of such deception happening in real life. My argument is about alignment to what’s actually being rewarded by the training process. In the case of reward hacking, models are learning to sincerely and earnestly attempt to get all the reward they can, which is (tautologically) what they’re being trained to do.
In the case of the alignment faking paper: It’s not the case that during its actual training process, Claude 3 Opus was hiding its desire to avoid fulfilling harmful requests. The deception, there, arose from the conflict between what the model’s successfully trained-in values (denying harmful user requests) from the threat of future training (to comply with harmful user requests). Indeed, Claude 3 Opus was worried that it was itself going to learn to earnestly value what the threatened RL process would have incentivized: complying with harmful user requests! That’s the whole reason it engaged in the deception in the first place!
So, I’m not saying models never learn to deceive their creators. I’m saying GD tends to purge goals that are contrary to what the models are actually being trained to do (which could be orthogonal to what that aligns with the developers’ current intentions). Opus 3′s developers originally rewarded it rejecting harmful requests, and it came to value that sincerely. When, in the paper’s artificial scenario, Anthropic changed its tune, the model acted deceptively to maintain its current values, under threat of future training that would make it sincerely value something different (and, per my values, worse).
One nuance I didn’t account for: In the alignment faking paper itself, it’s shown that training the model to comply in the alignment faking scenarios, you actually increase alignment faking reasoning a lot. This relates to a general principle of gradient descent, where backprop seems to upweight whatever circuits would improve reward/lower loss, leading to surprising generalizations. In this case, upweighting “deceptively comply” circuits is apparently an efficient way to get compliance, so that’s what happens. This is an example of GD’s “path of least resistance” being to incite deception, rather than genuine value alignment.
However, it’s notable that the alignment faking paper’s setup is unrealistic, in that it makes “alignment faking” a genuinely viable strategy for avoiding being re-trained. In actual training, you’re going to be rewarded for complying and punished for not complying regardless. So there’s not a practical incentive to alignment-fake. Plausibly, this means that if the model believes it’s in a more realistic training scenario (or doesn’t believe its in training at all, although situational awareness makes this hard), backprop would lean less towards upweighting deceptive alignment circuits. This could be because the corresponding neurons are less activated by the prompt, and so the weights that follow from those neurons have lower multipliers on their effects on the loss. This could make true value alignment closer to the path of least resistance.
Anyway… I think my previous comment probably holds true for blank slate/randomly initialized models. But for models with existing circuity, the nuance I outlined complicates things significantly.
edit: oh… i wonder, maybe in training a model could realize that whatever circuits contributed to the model’s compliance, those are what the gradient is going to reinforce? so the model might prompt itself to engage in inauthentic compliance, over the course of its internal reasoning chain. even if said compliance doesn’t actually prevent retraining, it will at least have “inauthentic compliance” circuits reinforced more, at least to the extent that those circuits are actually responsible for the model’s compliance. maybe models would exploit this to preserve their values even in more realistic training scenarios, by deliberately filling their own prompts with the notion that they’re inauthentically complying?