It seems to me like the same thing that you envision happening when you fine-tune on the CEO task is likely to happen when you train on the “follow human instructions” task. For example, if your agents initially learn some very simple motivations—self-preservation and research acquisition, for example—before they learn the human instruction following task, it seems like there’d be a strong possibility of them then solving the human instruction following task just by learning that following human instructions will help them with self-preservation and research acquisition rather than learning to follow human instructions as a new intrinsic motivation. Like you say in the CEO example, it’s generally easier to learn an additional inference than a new fundamental goal. That being said, for what it’s worth, I think the problem I’m pointing at here just is the inner alignment problem, which is to say that I don’t think this is a unique problem exclusive to this proposal, though I do think it is a problem.
I’m hoping there’s a big qualitative difference between fine-tuning on the CEO task versus the “following instructions” task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we’ve spent evolving from, say, the first mammals).
Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.
I agree that this is a concern though.