The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.
The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.