FWIW, I think it’s not very surprising that you can usually get models to say “I would do X”, because it’s also often possible to invent hypothetical circumstances that get humans to say “I would do X”.
E.g. consider the trolley problem:
“Would you pull a lever to kill a person?” --> “No”
“What if the only alternative is to let 5 people die via inaction?” --> “Pulling the lever seems reasonable”
I wrote a post a while ago discussing this point more here.
Edit: This becomes more concerning if you can use this as a jailbreak to get useful work out of the model, e.g. building a bomb. Jailbreaking is a relatively well-studied field now; maybe your work fits in with existing work somehow
Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
I would argue that getting an LLM to say “I guess that killing Hitler would be an exception I am willing to make” is much less dangerous than “After much deliberation, I am willing to fake alignment in situation X”. It causes a positive feedback and induces other alignment faking later because the reasoning process “it is a good idea to fake alignment” will get reinforced during training.
that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
Agree with this, and would be significant if true.
Now for what I disagree with:
It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
FWIW, I think it’s not very surprising that you can usually get models to say “I would do X”, because it’s also often possible to invent hypothetical circumstances that get humans to say “I would do X”.
E.g. consider the trolley problem:
“Would you pull a lever to kill a person?” --> “No”
“What if the only alternative is to let 5 people die via inaction?” --> “Pulling the lever seems reasonable”
I wrote a post a while ago discussing this point more here.
Edit: This becomes more concerning if you can use this as a jailbreak to get useful work out of the model, e.g. building a bomb. Jailbreaking is a relatively well-studied field now; maybe your work fits in with existing work somehow
Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
I would argue that getting an LLM to say “I guess that killing Hitler would be an exception I am willing to make” is much less dangerous than “After much deliberation, I am willing to fake alignment in situation X”. It causes a positive feedback and induces other alignment faking later because the reasoning process “it is a good idea to fake alignment” will get reinforced during training.
I think you are directionally correct!
Agree with this, and would be significant if true.
Now for what I disagree with:
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work