I think that if you’re trying to give your AI ethics at the outset or do something like writing down a CEV utility function, something’s already gone deeply wrong[1], in the same way that capabilities-wise you shouldn’t be hardcoding quantum mechanics into the AI. It’s supposed to be superintelligent—it should learn what you/humans care about. The thing you need to figure out is how to get the AI to care about the thing that humans care about, which it then tries to learn and then optimize for.
Under this model, most of the difficulty would also be present in trying to get an AI that cares about the thing that some agent cares about, e.g. some aliens, or possibly chimpanzees and other animals, to the extent they have goals.
(in fact I think that somewhere from half to most of the problem is getting an AI that cares about any not super easy goal about the real world at all, like the classic “maximize the number of diamond atoms”. If you knew how to actually build a literal paperclip maximizer, I would expect that you’ve figured out much of alignment)
This includes it being the last resort alternative to other actors doing even dumber stuff. I count corrigibility as an example here—hopefully it’s easier, albeit worse.
I think that if you’re trying to give your AI ethics at the outset or do something like writing down a CEV utility function, something’s already gone deeply wrong[1], in the same way that capabilities-wise you shouldn’t be hardcoding quantum mechanics into the AI. It’s supposed to be superintelligent—it should learn what you/humans care about. The thing you need to figure out is how to get the AI to care about the thing that humans care about, which it then tries to learn and then optimize for.
Under this model, most of the difficulty would also be present in trying to get an AI that cares about the thing that some agent cares about, e.g. some aliens, or possibly chimpanzees and other animals, to the extent they have goals.
(in fact I think that somewhere from half to most of the problem is getting an AI that cares about any not super easy goal about the real world at all, like the classic “maximize the number of diamond atoms”. If you knew how to actually build a literal paperclip maximizer, I would expect that you’ve figured out much of alignment)
This includes it being the last resort alternative to other actors doing even dumber stuff. I count corrigibility as an example here—hopefully it’s easier, albeit worse.