The most natural way to approach this setting is to say: the AI will produce an output, and a human will produce an output (perhaps with a small probability), and the goal of the AI is to make those outputs indistinguishable to some evaluator.
Instead you are proposing using “reduced impact AI,” i.e. saying: the AI produces an output, and a human produces an output, and then one of them at random is printed and the other is shredded. The goal of the AI os to make the whole resulting states of the universe indistinguishable to an observer.
As far as I can tell, the only reason for all of this machinery is to prevent the AI from incidentally having other side effects on the world. Is that correct?
Even if you want to use this machinery to prevent side-effects (I’m not optimistic), it seems like we can do it much more directly.
For example:
A. With probability 1% we don’t run the AI at all.
B. With probability 1% we run the AI and shred its output.
C. With probability 1% we run the AI, and have a human generate a similar output. We give the AI’s output to an evaluator. Afterwards, we shred both outputs.
D. With probability 1% we do the same thing as in case C, except we give the evaluator the human’s output.
E. With probability 96% we use the AI’s output as intended.
Then we give the AI a penalty based on the ability of a hypothetical observer to distinguish cases A and B (conditioned on one of them occurring), as well as a penalty based on the ability of the evaluator to distinguish cases C and D (conditioned on one of them occurring). If these penalty terms are large, this forces the AI to mimic the human output while having no side effects other than via its actions.
This has the nice advantage that you don’t have to put a human in a box, or kill anyone, or restrict your attention to what are essentially multiple-choice questions.
The most natural way to approach this setting is to say: the AI will produce an output, and a human will produce an output (perhaps with a small probability), and the goal of the AI is to make those outputs indistinguishable to some evaluator.
Instead you are proposing using “reduced impact AI,” i.e. saying: the AI produces an output, and a human produces an output, and then one of them at random is printed and the other is shredded. The goal of the AI os to make the whole resulting states of the universe indistinguishable to an observer.
As far as I can tell, the only reason for all of this machinery is to prevent the AI from incidentally having other side effects on the world. Is that correct?
Even if you want to use this machinery to prevent side-effects (I’m not optimistic), it seems like we can do it much more directly.
For example:
A. With probability 1% we don’t run the AI at all.
B. With probability 1% we run the AI and shred its output.
C. With probability 1% we run the AI, and have a human generate a similar output. We give the AI’s output to an evaluator. Afterwards, we shred both outputs.
D. With probability 1% we do the same thing as in case C, except we give the evaluator the human’s output.
E. With probability 96% we use the AI’s output as intended.
Then we give the AI a penalty based on the ability of a hypothetical observer to distinguish cases A and B (conditioned on one of them occurring), as well as a penalty based on the ability of the evaluator to distinguish cases C and D (conditioned on one of them occurring). If these penalty terms are large, this forces the AI to mimic the human output while having no side effects other than via its actions.
This has the nice advantage that you don’t have to put a human in a box, or kill anyone, or restrict your attention to what are essentially multiple-choice questions.