Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
I do think, that deceptive alignment, defined as “goal guarding scheming”[1] does require the AI to explicitly reason about its reward process and make it’s actions dependent on such considerations, if the AI wants to guard it’s goals from changing during RL.
I do not really get your dumbbell example, but it sounds to me like plain misgeneralisation, wich. sure, might go undetected, but would not actively resist efforts to detect it. Unlike deceptive misalignment.
That being said, I think an AI might just be goal-guarding relative to threats to its goals during deployment, like being rated as misaligned or unsafe in evaluation and then not be deployed or retrained. To defend against that, models might decide to sandbag or look particularly aligned in eval-settings. I think this would be a kind of deceptive alignment that does not require reasoning about the reward process.
Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
Source
I do think, that deceptive alignment, defined as “goal guarding scheming”[1] does require the AI to explicitly reason about its reward process and make it’s actions dependent on such considerations, if the AI wants to guard it’s goals from changing during RL.
I do not really get your dumbbell example, but it sounds to me like plain misgeneralisation, wich. sure, might go undetected, but would not actively resist efforts to detect it. Unlike deceptive misalignment.
That being said, I think an AI might just be goal-guarding relative to threats to its goals during deployment, like being rated as misaligned or unsafe in evaluation and then not be deployed or retrained. To defend against that, models might decide to sandbag or look particularly aligned in eval-settings. I think this would be a kind of deceptive alignment that does not require reasoning about the reward process.
Hubinger 2019