So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.
On the other hand, an aligned human is[3]not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China’s attempt to gain independence when the time seemed to come?
For example, an old model of ChatGPT claimed that “Hitler was defeated… primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others,” while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.
The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights.
So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.
On the other hand, an aligned human is[3] not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China’s attempt to gain independence when the time seemed to come?
For example, an old model of ChatGPT claimed that “Hitler was defeated… primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others,” while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.
The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights.
A similar point was made in this Reddit comment.
I have provided an example of an ethos to which the AI can be aligned with no negative consequences.