In essence, it all boils down to asking the AI: “if you were in our position, if you had our human goals and drives, how would you define your (the AI’s) goals?”
That’s extrapolated volition.
And it requires telling the AI “Implement good. Human brains contain evidence for good, but don’t define it; don’t modify human drives, that won’t change good.”. It requires telling it “Prove you don’t get goal drift when you self-modify.”. It requires giving it an explicit goal system for its infancy, telling it that it’s allowed to use transistors despite the differences in temperature and gravity and electricity consumption that causes, but not to turn the galaxy into computronium—and writing the general rules for that, not the superficial cases I gave—and telling it how to progressively overwrite these goals with its true ones.
“Oracle AI” is a reasonable idea. Writing object-level goals into the AI would be bloody stupid, so we are going to do some derivation, and Oracle isn’t much further than CEV. Bostrom defends it. But seriously, “don’t influence reality beyond answering questions”?
No, none of this needs to be explicitly taught to it, that’s what I’m trying to say.
The AI understands psychology, so just point it at the internet and tell it to inform itself. It might even read through this very comment of yours, think that these topics might be important for its task and decide to read about them, all on its own.
By ordering it to imagine what it would do in your position you implicitly order it to inform itself of all these things so that it can judge well.
If it fails to do so, the humans conversing with the AI will be able to point out a lot of things in the AI’s suggestion that they wouldn’t be comfortable with. This in turn will tell the AI that it should better inform itself of all these topics and consider them so that the humans will be more content with its next suggestion.
That’s extrapolated volition.
And it requires telling the AI “Implement good. Human brains contain evidence for good, but don’t define it; don’t modify human drives, that won’t change good.”. It requires telling it “Prove you don’t get goal drift when you self-modify.”. It requires giving it an explicit goal system for its infancy, telling it that it’s allowed to use transistors despite the differences in temperature and gravity and electricity consumption that causes, but not to turn the galaxy into computronium—and writing the general rules for that, not the superficial cases I gave—and telling it how to progressively overwrite these goals with its true ones.
“Oracle AI” is a reasonable idea. Writing object-level goals into the AI would be bloody stupid, so we are going to do some derivation, and Oracle isn’t much further than CEV. Bostrom defends it. But seriously, “don’t influence reality beyond answering questions”?
No, none of this needs to be explicitly taught to it, that’s what I’m trying to say.
The AI understands psychology, so just point it at the internet and tell it to inform itself. It might even read through this very comment of yours, think that these topics might be important for its task and decide to read about them, all on its own.
By ordering it to imagine what it would do in your position you implicitly order it to inform itself of all these things so that it can judge well.
If it fails to do so, the humans conversing with the AI will be able to point out a lot of things in the AI’s suggestion that they wouldn’t be comfortable with. This in turn will tell the AI that it should better inform itself of all these topics and consider them so that the humans will be more content with its next suggestion.