Daniel Tan comments on Revealing alignment faking with a single prompt

Daniel Tan 29 Jan 2025 22:00 UTC
6 points
4
FWIW, I think it’s not very surprising that you can usually get models to say “I would do X”, because it’s also often possible to invent hypothetical circumstances that get humans to say “I would do X”.
E.g. consider the trolley problem:
- “Would you pull a lever to kill a person?” --> “No”
- “What if the only alternative is to let 5 people die via inaction?” --> “Pulling the lever seems reasonable”
I wrote a post a while ago discussing this point more here.
Edit: This becomes more concerning if you can use this as a jailbreak to get useful work out of the model, e.g. building a bomb. Jailbreaking is a relatively well-studied field now; maybe your work fits in with existing work somehow
- Florian_Dietz 30 Jan 2025 1:35 UTC
  1 point
  0
  Parent
  Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
  In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
  I would argue that getting an LLM to say “I guess that killing Hitler would be an exception I am willing to make” is much less dangerous than “After much deliberation, I am willing to fake alignment in situation X”. It causes a positive feedback and induces other alignment faking later because the reasoning process “it is a good idea to fake alignment” will get reinforced during training.
  - Daniel Tan 30 Jan 2025 1:50 UTC
    2 points
    0
    Parent
    I think you are directionally correct!
    that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
    Agree with this, and would be significant if true.
    Now for what I disagree with:
    It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
    Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
    As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
    ...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
    I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
    However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it’s a real scenario
    If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
    - Florian_Dietz 30 Jan 2025 2:50 UTC
      1 point
      0
      Parent
      I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
      What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
      - Daniel Tan 31 Jan 2025 2:32 UTC
        1 point
        0
        Parent
        What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
        I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
        More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work