LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace ‘communism’ with ‘plans to take over.’ Humans have also produced a literature on this topic. Shouldn’t we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I’m more confident in. Now an argument I’m less confident in.
‘The literature’ would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like “learning to play the persona that does X” rather than “learning to do X.” So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very ‘stuck in its ways’ i.e. very dependent on ‘traditional’ ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they’re flawed.
Per both arguments, if you’re training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
I think if the AI tries to take over for training-data reasons, it’s probably more of a shallow roleplay, and while inconvenient, probably not catastrophic. Like, if the AI is roleplaying an AI system taking over, it’s not actually going to come up with completely novel ways of disempowering humanity in doing so, and not actually throwing it’s full cognition behind the task.
The much more scary thing is the AI deciding to take over because of the instrumental convergence of doing so. In that case the AI is likely actually trying to take over, instead of roleplaying taking over.
LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace ‘communism’ with ‘plans to take over.’ Humans have also produced a literature on this topic. Shouldn’t we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I’m more confident in. Now an argument I’m less confident in.
‘The literature’ would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like “learning to play the persona that does X” rather than “learning to do X.” So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very ‘stuck in its ways’ i.e. very dependent on ‘traditional’ ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they’re flawed.
Per both arguments, if you’re training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
I think if the AI tries to take over for training-data reasons, it’s probably more of a shallow roleplay, and while inconvenient, probably not catastrophic. Like, if the AI is roleplaying an AI system taking over, it’s not actually going to come up with completely novel ways of disempowering humanity in doing so, and not actually throwing it’s full cognition behind the task.
The much more scary thing is the AI deciding to take over because of the instrumental convergence of doing so. In that case the AI is likely actually trying to take over, instead of roleplaying taking over.
See also: “the void”, “Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models”