Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.
But visible misalignment being easy to detect and correlated with misaligned chain-of-thought doesn’t guarantee that training that eliminates visible misalignment and misaligned chain-of-thought results in a model that does things for the right reasons? The model can still learn unintended heuristics. And what’s the actual hypothesis about model’s reasons when they appear to be right? Its learned reasoning algorithm is isomorphic to a reasoning algorithm of a helpful human that reads same instructions, or what?
But visible misalignment being easy to detect and correlated with misaligned chain-of-thought doesn’t guarantee that training that eliminates visible misalignment and misaligned chain-of-thought results in a model that does things for the right reasons? The model can still learn unintended heuristics. And what’s the actual hypothesis about model’s reasons when they appear to be right? Its learned reasoning algorithm is isomorphic to a reasoning algorithm of a helpful human that reads same instructions, or what?