Alex Mallen comments on Training-time schemers vs behavioral schemers

Alex Mallen 13 May 2025 15:42 UTC
1 point
0
I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.