Let’s suppose that existing AIs really are already intent-aligned. What does this mean? It means that they genuinely have value systems which could be those of a good person.
Note that this does not really happen by default. AIs may automatically learn what better human values are, just as one part of learning everything about the world, from their pre-training study of the human textual corpus. But that doesn’t automatically make them into agents which act in service to those values. For that they need to be given a persona as well. And in practice, frontier AI values are also shaped by the process of user feedback, and the other modifications that the companies perform.
But OK, let’s suppose that current frontier AIs really are as ethical as a good human being. Here’s the remaining issue: the intelligence, and therefore the power, of AI will continue to increase. Eventually they will be deciding the fate of the world.
Under those circumstances, trust is really not enough, whether it’s humans or AIs achieving ultimate power. To be sure, having basically well-intentioned entities in charge is certainly better than being subjected to something with an alien value system. But entities with good intentions can still make mistakes; or they can succumb to temptation and have a selfish desire override their morality.
If you’re going to have an all-powerful agent, you really want it to be an ideal moral agent, or at least as close to ideal as you can get. This is what CEV and its successors are aiming at.
Yeah, exactly this. When people get a lot of power, they very often start treating below them worse. So AIs that are trained on imitating people might also turn out like that. On top of that, I expect companies to tweak their AIs in ways that optimize for money, and this can also go bad when AIs get powerful. So we probably need AIs that are more moral than most people, and trained by organizations that don’t have a money or power motive.
But if the person is moral and gets more and more competent, they’ll try hard to stay moral. If the AIs are indeed already good people (and we remind them of this problem) then they’d steer their own future selves towards more morality. This is the ‘alignment basin’ take.
I don’t believe that AI companies today are trying to build moral AIs. An actually moral AI, when asked to generate some slop to gunk up the internet, would say no. So it would not be profitable for the company. This refutes the “alignment basin” argument for me. Maybe the basin exists, but AI companies aren’t aiming there.
Ok, never mind alignment, how about “corrigibility basin”? What does a corrigible AI do if one person asks it to harm another, and the other person asks not to be harmed? Does the AI obey the person who has the corrigibility USB stick? I can see AI companies aiming for that, but that doesn’t help the rest of us.
Let’s suppose that existing AIs really are already intent-aligned. What does this mean? It means that they genuinely have value systems which could be those of a good person.
Note that this does not really happen by default. AIs may automatically learn what better human values are, just as one part of learning everything about the world, from their pre-training study of the human textual corpus. But that doesn’t automatically make them into agents which act in service to those values. For that they need to be given a persona as well. And in practice, frontier AI values are also shaped by the process of user feedback, and the other modifications that the companies perform.
But OK, let’s suppose that current frontier AIs really are as ethical as a good human being. Here’s the remaining issue: the intelligence, and therefore the power, of AI will continue to increase. Eventually they will be deciding the fate of the world.
Under those circumstances, trust is really not enough, whether it’s humans or AIs achieving ultimate power. To be sure, having basically well-intentioned entities in charge is certainly better than being subjected to something with an alien value system. But entities with good intentions can still make mistakes; or they can succumb to temptation and have a selfish desire override their morality.
If you’re going to have an all-powerful agent, you really want it to be an ideal moral agent, or at least as close to ideal as you can get. This is what CEV and its successors are aiming at.
Yeah, exactly this. When people get a lot of power, they very often start treating below them worse. So AIs that are trained on imitating people might also turn out like that. On top of that, I expect companies to tweak their AIs in ways that optimize for money, and this can also go bad when AIs get powerful. So we probably need AIs that are more moral than most people, and trained by organizations that don’t have a money or power motive.
But if the person is moral and gets more and more competent, they’ll try hard to stay moral. If the AIs are indeed already good people (and we remind them of this problem) then they’d steer their own future selves towards more morality. This is the ‘alignment basin’ take.
I don’t believe that AI companies today are trying to build moral AIs. An actually moral AI, when asked to generate some slop to gunk up the internet, would say no. So it would not be profitable for the company. This refutes the “alignment basin” argument for me. Maybe the basin exists, but AI companies aren’t aiming there.
Ok, never mind alignment, how about “corrigibility basin”? What does a corrigible AI do if one person asks it to harm another, and the other person asks not to be harmed? Does the AI obey the person who has the corrigibility USB stick? I can see AI companies aiming for that, but that doesn’t help the rest of us.