Thanks for writing this; I agree with most of what you’ve said. I wish the terminology was less confusing.
One clarification I want to make, though:
You describe deceptive alignment as being about the model taking actions so that the reward-generating process thinks that the actions are good. But most deceptive alignment threat models involve the model more generally taking actions that cause it to grab power later.
Some examples of such actions that aren’t getting better train loss or train-time reward:
if you do evaluations to see whether your sometimes takes bad actions, the model might take actions that look good there even though you’re not going to train on the model’s actions.
The model might sandbag. For example, you might use RL to try to train the model to trick a trusted model which you’re using to monitor it. A deceptively aligned model might not try its best here, even though this means it’s getting worse reward.
Thanks for writing this; I agree with most of what you’ve said. I wish the terminology was less confusing.
One clarification I want to make, though:
You describe deceptive alignment as being about the model taking actions so that the reward-generating process thinks that the actions are good. But most deceptive alignment threat models involve the model more generally taking actions that cause it to grab power later.
Some examples of such actions that aren’t getting better train loss or train-time reward:
if you do evaluations to see whether your sometimes takes bad actions, the model might take actions that look good there even though you’re not going to train on the model’s actions.
The model might sandbag. For example, you might use RL to try to train the model to trick a trusted model which you’re using to monitor it. A deceptively aligned model might not try its best here, even though this means it’s getting worse reward.
Thanks! OK, I just edited that part, is it better / less bad?