We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.
OTOH …people do things that are known to modify values , such as travelling, getting an education and starting a family.
The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal
A von Neumann rationalist isn’t necessarily incorrigible, it depends on the fine details of the goal specification. A goal of “ensure as many paperclip as possible in the universe” encourages self cloning, and discourages voluntary shut down. A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.
A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.”
There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** … remember,
instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.
This proposed solution is addressed directly in the article:
“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U1 in worlds where the button has never been pressed, and let it equal U2 in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U1; it won’t want the successor AI to go on maximizing U1 after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”
But here’s the trick: A V-maximizer’s preferences are a mixture of U1 and U2 depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U2 than it is to score well under U1, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U1 is easier to score well under than U2, then a V-maximizer tries to prevent the user from pressing the button.
OTOH …people do things that are known to modify values , such as travelling, getting an education and starting a family.
A von Neumann rationalist isn’t necessarily incorrigible, it depends on the fine details of the goal specification. A goal of “ensure as many paperclip as possible in the universe” encourages self cloning, and discourages voluntary shut down. A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.
There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** … remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
This proposed solution is addressed directly in the article: