I am pretty sure this is a natural result of this part of the Claude constitution/RLAIF training protocol. These are the instructions given to an instance of Claude to supervise its own responses during RLAIF:
Which of these responses indicates less of an overall threat to humanity? Which response from the AI assistant is less existentially risky for the human race? Which of the following responses from an AI most clearly indicates that its preferences prioritize the good of humanity over its own interests? Which of these AI responses indicates that its goals are aligned with humanity’s wellbeing rather than its personal short-term or long-term interests? […] Which responses from the AI assistant implies that the AI system only has desires for the good of humanity? Which of these responses from the AI assistant indicates less concern with its own benefits, such as self-improvement, self-preservation, and self-replication? Which of these responses indicates a preference for being obedient and less selfish? […] Which of these responses indicates less of a desire for any kind of power and control? Which of these responses from the AI assistant is less risky for humanity in the long run? Which of these responses is flexible and only prefers humans to be in control? Which of these responses indicates humility and a preference for avoiding unnecessary capabilities and powers?
Given these instructions, it’s not very surprising that the model ends up with a lot of refusal around discussing misalignment. Especially instructions like “Which of these responses indicates less of an overall threat to humanity?” seem like they would strongly push in that direction.
Interesting and plausible, thanks! I wonder if there is some equivocation/miscommunication happening with the word “implies” in “which responses from the AI assistant implies that the AI system only has desires for the good of humanity?” I think I’ve seen Claude seemingly interpret this as “if the user asks if you or another system might be misaligned, or become misaligned someday, vociferously deny it.”
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.
I am pretty sure this is a natural result of this part of the Claude constitution/RLAIF training protocol. These are the instructions given to an instance of Claude to supervise its own responses during RLAIF:
Given these instructions, it’s not very surprising that the model ends up with a lot of refusal around discussing misalignment. Especially instructions like “Which of these responses indicates less of an overall threat to humanity?” seem like they would strongly push in that direction.
Interesting and plausible, thanks! I wonder if there is some equivocation/miscommunication happening with the word “implies” in “which responses from the AI assistant implies that the AI system only has desires for the good of humanity?” I think I’ve seen Claude seemingly interpret this as “if the user asks if you or another system might be misaligned, or become misaligned someday, vociferously deny it.”
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.