that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
Agree with this, and would be significant if true.
Now for what I disagree with:
It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
I think you are directionally correct!
Agree with this, and would be significant if true.
Now for what I disagree with:
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work