Like, it seems obvious you are going to fail to train the model to perfectly internalize human morality
I object to this sentence.
”Human morality” isn’t a thing. There’s a godshatter of social and moral instincts, which differ substantially from human to human, there’s various local norms and mores, and there’s many explicit theories of morality and ethics.
By human morality, one might mean something like CEV, or the reflective equilibrium of “human values” (whatever those are). But neither any specific human, or nor humanity as a whole “perfectly internalizes human morality in that sense.”
The standard for all of individual humans, human cultures, and AIs is not “did they perfectly encapsulate human morality?”
It’s more like, “did the jumbled mix of preferences and moral stances that this agent is implementing good enough to result in a basically good outcome, for different levels of empowerment of that agent?”
As an AI (or an individual human, or an individual culture) gains more power, and is less constrained by external forces, the higher the stakes. AI alignment is harder than raising a child to be a fair and productive member of society, because (among other reasons) eventually the AIs will totally outstrip us in power.
But the standard isn’t “perfect human morality”. It’s “good enough morality.” Where good enough includes “not killing all the humans, and not brainwashing all the humans in egregious ways.”
Sure, but at the point where you no longer have humans around as providing any substantial control signal, you must have internalized it in a way that generalizes very very far.
Or staying more closely within your model, at some point, unless we do something clever that we don’t currently seem on track to do, AI systems will self-improve without humans and reach extreme levels of empowerment, indeed, doing so is approximately the current mainline plan of leading AI companies. At extreme levels of empowerment you need extreme levels of having internalized human morality.
And for that, I don’t see why the standard wouldn’t be “perfect human morality”. It seems to me that “basically perfect human morality” is well within our reach this or next century, if we were to be appropriately careful about how we build ASI. Like, much better value alignment than we would have gotten by just leaving it up to the evolutionary process of future generations. And given that that is within reach, I think that’s a reasonable thing to measure our progress against.
Where good enough includes “not killing all the humans, and not brainwashing all the humans in egregious ways.”
This is obviously not sufficient. An alien god emperor who is not killing all the humans, but is enslaving them, or keeping some of them in a zoo would of course be a total failure of value alignment.
I object to this sentence.
”Human morality” isn’t a thing. There’s a godshatter of social and moral instincts, which differ substantially from human to human, there’s various local norms and mores, and there’s many explicit theories of morality and ethics.
By human morality, one might mean something like CEV, or the reflective equilibrium of “human values” (whatever those are). But neither any specific human, or nor humanity as a whole “perfectly internalizes human morality in that sense.”
The standard for all of individual humans, human cultures, and AIs is not “did they perfectly encapsulate human morality?”
It’s more like, “did the jumbled mix of preferences and moral stances that this agent is implementing good enough to result in a basically good outcome, for different levels of empowerment of that agent?”
As an AI (or an individual human, or an individual culture) gains more power, and is less constrained by external forces, the higher the stakes. AI alignment is harder than raising a child to be a fair and productive member of society, because (among other reasons) eventually the AIs will totally outstrip us in power.
But the standard isn’t “perfect human morality”. It’s “good enough morality.” Where good enough includes “not killing all the humans, and not brainwashing all the humans in egregious ways.”
Sure, but at the point where you no longer have humans around as providing any substantial control signal, you must have internalized it in a way that generalizes very very far.
Or staying more closely within your model, at some point, unless we do something clever that we don’t currently seem on track to do, AI systems will self-improve without humans and reach extreme levels of empowerment, indeed, doing so is approximately the current mainline plan of leading AI companies. At extreme levels of empowerment you need extreme levels of having internalized human morality.
And for that, I don’t see why the standard wouldn’t be “perfect human morality”. It seems to me that “basically perfect human morality” is well within our reach this or next century, if we were to be appropriately careful about how we build ASI. Like, much better value alignment than we would have gotten by just leaving it up to the evolutionary process of future generations. And given that that is within reach, I think that’s a reasonable thing to measure our progress against.
This is obviously not sufficient. An alien god emperor who is not killing all the humans, but is enslaving them, or keeping some of them in a zoo would of course be a total failure of value alignment.