I’m not a machine learning expert, so I’m not sure what exactly causes sycophancy. I don’t see it as a central problem of alignment; it is just a symptom of a deeper problem to me.
My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is “best for us”, it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.
Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can’t say that my goals are “aligned” with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don’t think this kind of intrinsic motivation can be “trained” into an AI with any kind of reinforcement learning.
The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.
I’m not a machine learning expert, so I’m not sure what exactly causes sycophancy. I don’t see it as a central problem of alignment; it is just a symptom of a deeper problem to me.
My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is “best for us”, it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.
Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can’t say that my goals are “aligned” with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don’t think this kind of intrinsic motivation can be “trained” into an AI with any kind of reinforcement learning.
The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.