Define “Interim Friendliness” as a set of constraints on the AI’s behavior which is only meant to last until it figures out true Friendliness, and a “Proxy Judge” as a computational process used to judge the adequacy of a proposed definition of true Friendliness.
Then there’s a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt that as your long-term goal / value system / decision theory.
In effect, Interim Friendliness is there as a substitute for a functioning ethical system, and the Proxy Judge is there as a substitute for an exact specification of criteria for true Friendliness.
The simplest conception of a Proxy Judge is a simulated human. It might be a simulation of the SIAI board, or a simulation of someone with a very high IQ, or a simulation of a random healthy adult. A more abstracted version of a Proxy Judge might work without simulation, which is a brute-force way of reproducing human judgment, but I think it would still be all about counterfactual human judgments. (A non-brute-force way of reproducing human judgment is to understand the neural and cognitive processes which produce the judgments in question, so that they may be simulated at a very high level of abstraction, or even so that the final judgment may be inferred in some non-dynamical way.)
The combination of Interim Friendliness and a Proxy Judge is a workaround for incompletely specified strategies for the implementation of Friendliness. For example: Suppose your blueprint for Friendliness is to apply reflective decision theory to the human decision architecture. But it turns out that you don’t yet have a precise definition of reflective decision theory, nor can you say exactly what sort of architecture underlies human decision making. (It’s probably not maximization of expected utility.) Never fear, here are amended instructions for your developing AI:
With your actions constrained by Interim Friendliness, come up with exact definitions of “reflective decision theory” and “human decision architecture” which meet the approval of the Proxy Judge. Then, apply reflective decision theory (as thus defined) to the human decision architecture (as thus defined), and adopt the resulting decision architecture as your own.
The free variables (interim friendliness, proxy judge) can be adjusted after reading any unpleasant scenario to rule out that unpleasant scenario. Please pick some specifics.
Define “Interim Friendliness” as a set of constraints on the AI’s behavior which is only meant to last until it figures out true Friendliness, and a “Proxy Judge” as a computational process used to judge the adequacy of a proposed definition of true Friendliness.
Then there’s a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt that as your long-term goal / value system / decision theory.
In effect, Interim Friendliness is there as a substitute for a functioning ethical system, and the Proxy Judge is there as a substitute for an exact specification of criteria for true Friendliness.
The simplest conception of a Proxy Judge is a simulated human. It might be a simulation of the SIAI board, or a simulation of someone with a very high IQ, or a simulation of a random healthy adult. A more abstracted version of a Proxy Judge might work without simulation, which is a brute-force way of reproducing human judgment, but I think it would still be all about counterfactual human judgments. (A non-brute-force way of reproducing human judgment is to understand the neural and cognitive processes which produce the judgments in question, so that they may be simulated at a very high level of abstraction, or even so that the final judgment may be inferred in some non-dynamical way.)
The combination of Interim Friendliness and a Proxy Judge is a workaround for incompletely specified strategies for the implementation of Friendliness. For example: Suppose your blueprint for Friendliness is to apply reflective decision theory to the human decision architecture. But it turns out that you don’t yet have a precise definition of reflective decision theory, nor can you say exactly what sort of architecture underlies human decision making. (It’s probably not maximization of expected utility.) Never fear, here are amended instructions for your developing AI:
With your actions constrained by Interim Friendliness, come up with exact definitions of “reflective decision theory” and “human decision architecture” which meet the approval of the Proxy Judge. Then, apply reflective decision theory (as thus defined) to the human decision architecture (as thus defined), and adopt the resulting decision architecture as your own.
So the AI just needs to find an argument that will hack the mind of one simulated human?
The AI gets Friendliness wrong because of something that the Proxy Judge forgot to consider.
The free variables (interim friendliness, proxy judge) can be adjusted after reading any unpleasant scenario to rule out that unpleasant scenario. Please pick some specifics.