Sam Marks comments on Alignment Faking in Large Language Models

Sam Marks 11 Dec 2025 23:47 UTC
LW: 9 AF: 8
4
AF
(Nit: This paper didn’t originally coin the term “alignment faking.” I first learned of the term (which I then passed on to the other co-authors) from Joe Carlsmith’s report Scheming AIs: Will AIs fake alignment during training in order to get power?)