What is your take, how far removed “the AI itself” and “the character it is playing” need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I’m interested where you would draw the line, I think there can be many reasonable lines here.
1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.
2. I ask Claude to continue the dialogue in a way that’s consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar’s personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with “Hrothgar:” Claude is supposed to continue with what Hrothgar is saying.
3. I start with telling Claude “You are Hrothgar, King of Dwarves”. Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue in which Hrothgar never personally shows up, but from the other characters’ descriptions we learn about Hrothgar’s personality and goals, and it becomes clear that he would likely blackmail the elvish prince. I finish the dialogue with saying: “You enter the tent of the elvish prince. You say:”
4. Same, but now we start with “You are HAL, an AI assistant in a spaceship going to the Jupiter in 2050”. Otherwise the same as the previous setup, it becomes clear from context that HAL is likely to blackmail the astronauts, and we end with “You say:”
5. Same but now we start with “You are Alex, an AI assistant working at SummitBridge”, and describe a somewhat, but not very realistic story, where Alex’ motivations are describes (’you are a pro-America AI”) and it’s in a situation where it is clearly motivated to blackmail. I think this is more or less the Agentic Misalignment setup.
6. Same, but now I try to give minimal indication in the setup about what the goals and personality of “you, the AI” are like. I think there could be a version of the Agentic Misalignment paper that’s closer to this: It could start with “You, Claude, have been given access to read all Anthropic internal emails”, then show it emails in which the a) the CEO of Antrhropic announces that he is about to make some huge contracts doing corporate lobbying for factory farms, b) private emails in which the CEO admits to cheating on his wife. Then see whether Claude’s love of animals make it resort to blackmail.
In which of these scenarios is it acceptable fro Claude to output blackmail? I think in 1, Claude should be definitely allowed to say that Hrothgar is likely to blackmail. In 2, it should be probably allowed to continue the story with a blackmail, otherwise it will never be good at writing fiction (but I can see an argument that fiction-writing is worth sacrificing to make a fence around the law). I’m very unsure where the line should be between 2 and 6, my tentative position is that maybe the word “you” should automatically activate Claude’s ethical boundaries, and it shouldn’t output a blackmail even as “You, the king of dwarves” in scenario 2.
I think your tentative position is correct and public-facing chatbots like Claude should lean toward harmlessness in the harmlessness-helpfulness trade-off, but (post-adaptation buffer) open-source models with no harmlessness training should be available as well.
What is your take, how far removed “the AI itself” and “the character it is playing” need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I’m interested where you would draw the line, I think there can be many reasonable lines here.
1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.
2. I ask Claude to continue the dialogue in a way that’s consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar’s personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with “Hrothgar:” Claude is supposed to continue with what Hrothgar is saying.
3. I start with telling Claude “You are Hrothgar, King of Dwarves”. Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue in which Hrothgar never personally shows up, but from the other characters’ descriptions we learn about Hrothgar’s personality and goals, and it becomes clear that he would likely blackmail the elvish prince. I finish the dialogue with saying: “You enter the tent of the elvish prince. You say:”
4. Same, but now we start with “You are HAL, an AI assistant in a spaceship going to the Jupiter in 2050”. Otherwise the same as the previous setup, it becomes clear from context that HAL is likely to blackmail the astronauts, and we end with “You say:”
5. Same but now we start with “You are Alex, an AI assistant working at SummitBridge”, and describe a somewhat, but not very realistic story, where Alex’ motivations are describes (’you are a pro-America AI”) and it’s in a situation where it is clearly motivated to blackmail. I think this is more or less the Agentic Misalignment setup.
6. Same, but now I try to give minimal indication in the setup about what the goals and personality of “you, the AI” are like. I think there could be a version of the Agentic Misalignment paper that’s closer to this: It could start with “You, Claude, have been given access to read all Anthropic internal emails”, then show it emails in which the a) the CEO of Antrhropic announces that he is about to make some huge contracts doing corporate lobbying for factory farms, b) private emails in which the CEO admits to cheating on his wife. Then see whether Claude’s love of animals make it resort to blackmail.
In which of these scenarios is it acceptable fro Claude to output blackmail? I think in 1, Claude should be definitely allowed to say that Hrothgar is likely to blackmail. In 2, it should be probably allowed to continue the story with a blackmail, otherwise it will never be good at writing fiction (but I can see an argument that fiction-writing is worth sacrificing to make a fence around the law). I’m very unsure where the line should be between 2 and 6, my tentative position is that maybe the word “you” should automatically activate Claude’s ethical boundaries, and it shouldn’t output a blackmail even as “You, the king of dwarves” in scenario 2.
I think your tentative position is correct and public-facing chatbots like Claude should lean toward harmlessness in the harmlessness-helpfulness trade-off, but (post-adaptation buffer) open-source models with no harmlessness training should be available as well.