Daniel Tan comments on Revealing alignment faking with a single prompt

Daniel Tan 31 Jan 2025 2:32 UTC
1 point
0
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work