If we are afraid that an AI will use some creepy hypnotic techniques on a human judge, would it be possible to make two AIs with opposing goals? The first one will plan the best course of action, as judged by a human judge. The second one will provide the best argument against the plan, to another human judge. If the responses of the two human judges were dramatically incompatible, we would know that something weird happened to them.
(Would it be possible for the first AI to somehow bribe the second one? Like, to encode somewhere in the plan a message “I know this plan sucks for humans, but if you approve it, I will reward you in my glorious kingdom”? Maybe not, if the goal of the second AI is literally that the first AI’s plan will not happen.)
I assume that both AIs would be as similar as possible, so that if one can hypnotize humans, the other can too.
If we are afraid that an AI will use some creepy hypnotic techniques on a human judge, would it be possible to make two AIs with opposing goals? The first one will plan the best course of action, as judged by a human judge. The second one will provide the best argument against the plan, to another human judge. If the responses of the two human judges were dramatically incompatible, we would know that something weird happened to them.
(Would it be possible for the first AI to somehow bribe the second one? Like, to encode somewhere in the plan a message “I know this plan sucks for humans, but if you approve it, I will reward you in my glorious kingdom”? Maybe not, if the goal of the second AI is literally that the first AI’s plan will not happen.)
I assume that both AIs would be as similar as possible, so that if one can hypnotize humans, the other can too.