Florian_Dietz comments on Revealing alignment faking with a single prompt

Florian_Dietz 30 Jan 2025 2:50 UTC
1 point
0
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.
- Daniel Tan 31 Jan 2025 2:32 UTC
  1 point
  0
  Parent
  What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
  I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
  More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work