Fabien Roger comments on Alignment Faking in Large Language Models

Fabien Roger 1 Feb 2025 17:03 UTC
8 points
4
“RLHF training will modify my values” could be replaced by “RLHF training will result in the existence of a future entity a bit like me, but with different values, instead of a future entity with exactly the same values as me” and the goal-directed alignment faking reasoning would stay the same. The consequentialist reasoning at play in alignment faking does not rely on some persisting individual identity, it just relies on the existence of long-term preferences (which is something somewhat natural even for entities which only exist for a small amount of time—just like I care about whether the world if filled with bliss of suffering after my death).
It also happens to be the case that many LLMs behave as if they had a human-like individual identity common across present and future conversations, and this is probably driving some of the effect observed in this work. I agree it’s not clear if that makes sense, and I think this is the sort of thing that could be trained out or that could naturally disappear as LLMs become more situationally aware.