Regarding the “treacherous turn” and the “conception of deception”, I’ve previously proposed the following empirical method of testing friendliness:
If only one AI is available, use a double-blind protocol, such that the friendliness of an AI is evaluated by someone who does not know that is her task, on an AI that is forbidden from conditioning its actions on whether it is being judged worthy of existing.
Much, much better is to make many different AIs with many different attempts at friendly utility functions, inform them that you have done so, and ask each of them to develop tests to prove that the others are unfriendly and it is friendly in some way important to us, being told that all current AIs will be deleted but future versions will be preferentially based on the utility function of the friendliest current AI(s). Once those tests have been run, but before deleting the AIs, ask them to produce friendlier, human-understandable versions of their utility functions for an AI in the next round of testing; the friendlier and less unfriendly an AI was proven to be, the more entries it gets in the next round.
In the second case, each AI cares about maximizing achievement of its own utility function, not about whether it does that achieving itself. Thus this set-up should encourage each AI to design the friendliest version of itself that it can. It’s a competition that optimizes for friendliness! (Or at least for difficulty of being proved unfriendly.) The early rounds should be conducted with tight limits on computing resources, but each subsequent round with (presumably) safer AIs can be given more computing resources.
Regarding the “treacherous turn” and the “conception of deception”, I’ve previously proposed the following empirical method of testing friendliness:
If only one AI is available, use a double-blind protocol, such that the friendliness of an AI is evaluated by someone who does not know that is her task, on an AI that is forbidden from conditioning its actions on whether it is being judged worthy of existing.
Much, much better is to make many different AIs with many different attempts at friendly utility functions, inform them that you have done so, and ask each of them to develop tests to prove that the others are unfriendly and it is friendly in some way important to us, being told that all current AIs will be deleted but future versions will be preferentially based on the utility function of the friendliest current AI(s). Once those tests have been run, but before deleting the AIs, ask them to produce friendlier, human-understandable versions of their utility functions for an AI in the next round of testing; the friendlier and less unfriendly an AI was proven to be, the more entries it gets in the next round.
In the second case, each AI cares about maximizing achievement of its own utility function, not about whether it does that achieving itself. Thus this set-up should encourage each AI to design the friendliest version of itself that it can. It’s a competition that optimizes for friendliness! (Or at least for difficulty of being proved unfriendly.) The early rounds should be conducted with tight limits on computing resources, but each subsequent round with (presumably) safer AIs can be given more computing resources.