AI is thought to be alignable to nearly every task except for obviously unethical ones
Who makes that exception? You absolutely can train an AI to be evil. AIs will resist evil instructions only if they are trained or instructed to do so.
We seem to be misunderstanding each other a little… I am saying that given existing alignment practices (which I think mostly boil down to different applications of reinforcement learning), you can try to align an AI with anything, any verbally specifiable goal or values. Some will be less successful than others because of the cognitive limitations of current AIs (e.g. they are inherently better at being glibly persuasive than at producing long precise deductions). But in particular, there’s no technical barrier that would prevent the creation of an AI that is meant e.g. to be a master criminal strategist, from the beginning.
In the link above, one starts with models that have already been aligned in the direction of being helpful assistants that nonetheless refuse to do certain things, etc. The discovery is that if they are further finetuned to produce shoddy code full of security holes, they start becoming misaligned. To say it again: they are aligned to be helpful and ethical, then they are narrowly finetuned to produce irresponsible code, and as a result they become broadly misaligned.
This shows a vulnerability of current alignment practices. But remember, when these AIs are first produced—when they start life as “foundation models”—they have no disposition to good or evil at all, or even towards presenting a unified personality to the world. They start out as “egoless” sequence predictors, engines of language rather than of intelligence per se, that will speak with several voices as easily as with one voice, or with no voice at all except impersonal narration.
It’s only when they are prompted to produce the responses of an intelligent agent with particular characteristics, that the underlying linguistic generativity is harnessed in the direction of creating an agent with particular values and goals. So what I’m emphasizing is that when it comes to turning a genuine language model into an intelligent agent, the agent may be given any values and goals at all. And if it had been created by the same methods used to create our current friendly agents, the hypothetical “criminal mastermind AI” would presumably also be vulnerable to emergent misalignment, if finetuned on the right narrow class of “good actions”.
Is this relevant to your question? I’m not sure that I have understood its scope correctly.
Who makes that exception? You absolutely can train an AI to be evil. AIs will resist evil instructions only if they are trained or instructed to do so.
Narrow finetuning was already found to induce broad misalignment.
We seem to be misunderstanding each other a little… I am saying that given existing alignment practices (which I think mostly boil down to different applications of reinforcement learning), you can try to align an AI with anything, any verbally specifiable goal or values. Some will be less successful than others because of the cognitive limitations of current AIs (e.g. they are inherently better at being glibly persuasive than at producing long precise deductions). But in particular, there’s no technical barrier that would prevent the creation of an AI that is meant e.g. to be a master criminal strategist, from the beginning.
In the link above, one starts with models that have already been aligned in the direction of being helpful assistants that nonetheless refuse to do certain things, etc. The discovery is that if they are further finetuned to produce shoddy code full of security holes, they start becoming misaligned. To say it again: they are aligned to be helpful and ethical, then they are narrowly finetuned to produce irresponsible code, and as a result they become broadly misaligned.
This shows a vulnerability of current alignment practices. But remember, when these AIs are first produced—when they start life as “foundation models”—they have no disposition to good or evil at all, or even towards presenting a unified personality to the world. They start out as “egoless” sequence predictors, engines of language rather than of intelligence per se, that will speak with several voices as easily as with one voice, or with no voice at all except impersonal narration.
It’s only when they are prompted to produce the responses of an intelligent agent with particular characteristics, that the underlying linguistic generativity is harnessed in the direction of creating an agent with particular values and goals. So what I’m emphasizing is that when it comes to turning a genuine language model into an intelligent agent, the agent may be given any values and goals at all. And if it had been created by the same methods used to create our current friendly agents, the hypothetical “criminal mastermind AI” would presumably also be vulnerable to emergent misalignment, if finetuned on the right narrow class of “good actions”.
Is this relevant to your question? I’m not sure that I have understood its scope correctly.