Buck comments on Deceptive AI ≠ Deceptively-aligned AI

Buck 7 Jan 2024 17:41 UTC
LW: 14 AF: 9
3
AF
Thanks for writing this; I agree with most of what you’ve said. I wish the terminology was less confusing.

One clarification I want to make, though:

You describe deceptive alignment as being about the model taking actions so that the reward-generating process thinks that the actions are good. But most deceptive alignment threat models involve the model more generally taking actions that cause it to grab power later.

Some examples of such actions that aren’t getting better train loss or train-time reward:
- if you do evaluations to see whether your sometimes takes bad actions, the model might take actions that look good there even though you’re not going to train on the model’s actions.
- The model might sandbag. For example, you might use RL to try to train the model to trick a trusted model which you’re using to monitor it. A deceptively aligned model might not try its best here, even though this means it’s getting worse reward.
What links here?
- Deceptive AI ≠ Deceptively-aligned AI by Steven Byrnes (7 Jan 2024 16:55 UTC; 107 points)
- AI #47: Meet the New Year by Zvi (13 Jan 2024 16:20 UTC; 36 points)
- Steven Byrnes 7 Jan 2024 18:13 UTC
  LW: 6 AF: 5
  0
  AF Parent
  Thanks! OK, I just edited that part, is it better / less bad?