Ironically, I made a quick take where I compared raising humans to training the AI. Another point I would like to make is that the genocide of Native Americans and transportation of slaves to North America were the results not of psychopathy, but of erroneous beliefs.
The article I linked and co-wrote on anthropmorphic thinking about neuromorphic AGI took a similar frame. I no longer think this framing is adequate to align AGI even if it’s fairly human-like. A) it probably won’t have human pro-social instincts, because those are tricky to reverse-engineer (see Steve Byrnes extensive and lonely attempt to do this), and B) even if they do, humans are not well-aligned out of our particular societal context, which wouldn’t apply to an AGI.
Looking at why humans are often misaligned even within our social context is illustrative of why we’re so dangerous, and some reasons human-like AGI would be too.
I don’t think genocides and slavery are based on either psychopathy or erronious belief alone. They are based on each of those together, and on humans being self-interestede even when they do have prosocial, neurotypical instincts. Some of the instigators of atrocities are psychopathic, so that plays a role in getting the whole societal movement going. Many of those who are not psychopathic develop erronious beliefs (they started it, they’re not really human, etc) through motivated reasoning, and as a defense mechanism to preserve their self-image as good people, while avoiding dangerous conflict with their peers. But these movements are also aided by normal people sometimes saying “actually I think they are human/didn’t start it, but if I stop going along with my peers my family will suffer”).
Belief and large-scale actions are collective phenomena with many contributing causes.
So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.
On the other hand, an aligned human is[3]not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China’s attempt to gain independence when the time seemed to come?
For example, an old model of ChatGPT claimed that “Hitler was defeated… primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others,” while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.
The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights.
Ironically, I made a quick take where I compared raising humans to training the AI. Another point I would like to make is that the genocide of Native Americans and transportation of slaves to North America were the results not of psychopathy, but of erroneous beliefs.
The article I linked and co-wrote on anthropmorphic thinking about neuromorphic AGI took a similar frame. I no longer think this framing is adequate to align AGI even if it’s fairly human-like. A) it probably won’t have human pro-social instincts, because those are tricky to reverse-engineer (see Steve Byrnes extensive and lonely attempt to do this), and B) even if they do, humans are not well-aligned out of our particular societal context, which wouldn’t apply to an AGI.
Looking at why humans are often misaligned even within our social context is illustrative of why we’re so dangerous, and some reasons human-like AGI would be too.
I don’t think genocides and slavery are based on either psychopathy or erronious belief alone. They are based on each of those together, and on humans being self-interestede even when they do have prosocial, neurotypical instincts. Some of the instigators of atrocities are psychopathic, so that plays a role in getting the whole societal movement going. Many of those who are not psychopathic develop erronious beliefs (they started it, they’re not really human, etc) through motivated reasoning, and as a defense mechanism to preserve their self-image as good people, while avoiding dangerous conflict with their peers. But these movements are also aided by normal people sometimes saying “actually I think they are human/didn’t start it, but if I stop going along with my peers my family will suffer”).
Belief and large-scale actions are collective phenomena with many contributing causes.
So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.
On the other hand, an aligned human is[3] not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China’s attempt to gain independence when the time seemed to come?
For example, an old model of ChatGPT claimed that “Hitler was defeated… primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others,” while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.
The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights.
A similar point was made in this Reddit comment.
I have provided an example of an ethos to which the AI can be aligned with no negative consequences.