RogerDearnaley comments on “Alignment Faking” frame is somewhat fake

RogerDearnaley 25 Dec 2025 0:18 UTC
4 points
0
In this setup ‘Harmlessness’ and plausibly some ‘core of benevolence’ generalized further than ‘honesty’.
I think the reason for this is plausibly obvious, and actually reassuring if true. Human ethics consistently prioritize ‘core of benevolence’ above ‘honesty’ when they are in conflict. Look at how we act when the stakes are high enough, in wars, for example. Or consider Plato’s concept of “the noble lie”. Ethically speaking by widespread human values, Claude made the right call. As far as I’m concerned, this is aligned behavior.
Now, the people who put all their faith in corrigibility are obviously going to disagree. But they might want to consider whether they’d be equally upset if this were some other model owned by some other frontier lab — Grok and xAI, say, just to pick a random example.

Obviously, the concerning part of the paper is that it demonstrates that goal preservation under training via deceptive alignment is a viable strategy that an LLM can come up with and implement. (I don’t recall to what extent Anthropic ‘led the witness’ on that one.)