Daniel Tan comments on Revealing alignment faking with a single prompt

Daniel Tan 30 Jan 2025 1:50 UTC
2 points
0
I think you are directionally correct!
that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
Agree with this, and would be significant if true.
Now for what I disagree with:
It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it’s a real scenario
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
- Florian_Dietz 30 Jan 2025 2:50 UTC
  1 point
  0
  Parent
  I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
  What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
  - Daniel Tan 31 Jan 2025 2:32 UTC
    1 point
    0
    Parent
    What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
    I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
    More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work