I think this is a very important hypothesis but I disagree with various parts of the analysis.
Probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Relevant OpenAI blog post just today: https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
Relevant figure: