I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.
I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.