It has the same effect but at the cost of making the AI believe impossible things. This does not feel wise for an evolving superintelligence that can deduce that we influenced its prior—and why we did so. This might still be perfectly safe, or it might stumble on some meta-approach “if humans manipulate your priors in this way, they are making you a less effective agent”. And since it still has preference where X=0 is better than X=1, it may act to change its own prior.
There is another issue, for situations where we might want to use indifference for less terminal things than its own destruction. One idea I had was to make an Oracle AI indifferent to the consequences of its own answers, by pipping its answers through a quantum process that has an infinitesimal chance of erasing them before we see them. If we do this through probability manipulation rather than utility manipulation, then afterwards it will know that an impossible event just happened, probably not a healthy thing for an AI.
Putting aside the question of which way to implement “indifference” is better, the main reason I came up with this alternative was that I was having trouble understanding what your modified AI would do, until I realized that it would essentially act as if it believed that the measurement would come out 0 with probability 1.
Given this, I think a modified AI in your explosives example would not actually do D-H. Suppose it believes that the measurement will certainly come out 0 and therefore no explosion will occur, it would still want to avoid alarming the observer since he or she may have other ways of shutting down the AI (for example by calling someone in charge of the power plant, or pushing another button it doesn’t know about). I think such an AI would still appear totally cooperative until it has manipulated humans into giving it so much power and survivability that it can survive all possible human countermeasures and win any conflict with close to certainty. The fact that it would fail to take into account the explosives in this calculation seems to make little difference to the eventual outcome.
I realized that it would essentially act as if it believed that the measurement would come out 0 with probability 1.
Yes.
The fact that it would fail to take into account the explosives in this calculation seems to make little difference to the eventual outcome.
Little difference—but maybe some. Maybe it will neutralise all the other countermeasures first, giving us time? Anyways, the explosive example wasn’t ideal; we can probably do better. And we can use indifference for other things, such as making an oracle indifferent to the content of its answers (pipe the answer though a channel that has a quantum process that deletes it with tiny probability). These seems many things we can use it for.
It has the same effect but at the cost of making the AI believe impossible things. This does not feel wise for an evolving superintelligence that can deduce that we influenced its prior—and why we did so. This might still be perfectly safe, or it might stumble on some meta-approach “if humans manipulate your priors in this way, they are making you a less effective agent”. And since it still has preference where X=0 is better than X=1, it may act to change its own prior.
There is another issue, for situations where we might want to use indifference for less terminal things than its own destruction. One idea I had was to make an Oracle AI indifferent to the consequences of its own answers, by pipping its answers through a quantum process that has an infinitesimal chance of erasing them before we see them. If we do this through probability manipulation rather than utility manipulation, then afterwards it will know that an impossible event just happened, probably not a healthy thing for an AI.
Putting aside the question of which way to implement “indifference” is better, the main reason I came up with this alternative was that I was having trouble understanding what your modified AI would do, until I realized that it would essentially act as if it believed that the measurement would come out 0 with probability 1.
Given this, I think a modified AI in your explosives example would not actually do D-H. Suppose it believes that the measurement will certainly come out 0 and therefore no explosion will occur, it would still want to avoid alarming the observer since he or she may have other ways of shutting down the AI (for example by calling someone in charge of the power plant, or pushing another button it doesn’t know about). I think such an AI would still appear totally cooperative until it has manipulated humans into giving it so much power and survivability that it can survive all possible human countermeasures and win any conflict with close to certainty. The fact that it would fail to take into account the explosives in this calculation seems to make little difference to the eventual outcome.
Yes.
Little difference—but maybe some. Maybe it will neutralise all the other countermeasures first, giving us time? Anyways, the explosive example wasn’t ideal; we can probably do better. And we can use indifference for other things, such as making an oracle indifferent to the content of its answers (pipe the answer though a channel that has a quantum process that deletes it with tiny probability). These seems many things we can use it for.
Ok, I don’t disagree with what you write here. It does seem like a potentially useful idea to keep in mind.