So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.
Am I misunderstanding or is this basically the equivalent of a middle schooler “showing their work” on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?
I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.
Am I misunderstanding or is this basically the equivalent of a middle schooler “showing their work” on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?
I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification.