I think I understood the first three paragraphs. The AI “ramming a button to the human” clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario—by preventing the AI from doing this (boxing), ensuring it doesn’t want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase “assume, for the sake of argument, that you have solved all the ‘obvious’ problems”.
Or did you have something else in mind by the first three paragraphs?
I didn’t understand the last paragraph. Or rather, I didn’t understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.
The point of the last paragraph was that if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want.
If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won’t do much of anything. For the ??? AI that doesn’t want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.
if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you—hating me—believe it is poison, we will happily coordinate towards me eating it.
I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don’t see how this would happen, but that isn’t a particularly strong argument.)
Also, the debaters better be comparably smart.
Yes, this seems like a necessary assumption in a symmetric debate.
Once again, this is trivially satisfied if the debaters are copies of each other.
It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)
I think I understood the first three paragraphs. The AI “ramming a button to the human” clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario—by preventing the AI from doing this (boxing), ensuring it doesn’t want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase “assume, for the sake of argument, that you have solved all the ‘obvious’ problems”.
Or did you have something else in mind by the first three paragraphs?
I didn’t understand the last paragraph. Or rather, I didn’t understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.
The point of the last paragraph was that if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want.
If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won’t do much of anything. For the ??? AI that doesn’t want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.
Also, the debaters better be comparably smart.
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you—hating me—believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don’t see how this would happen, but that isn’t a particularly strong argument.)
Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)