I think that one problem with debate is improper adversarialization.
“Winning the debate” does not refer directly to a particular set of quantum wave functions. What exactly are the AIs optimizing for? Because if the AIs are optimizing for things that seem opposite, but aren’t, you get the cooperative hacking failure mode.
Suppose that each AI has a wire in its own circuitry, with its utility function being linear in the voltage in the wire. The human has a switch, that can direct a pulse of 1v to either wire. The AI’s work together to persuade the human to to cut open their cases, pull out the wires and attach them to a high voltage power source.
If these systems are something like reinforcement learning agents, then there were many different events that were all strongly correlated in training. These events include large amounts of human DNA (in the form of a pressing finger) being just above the button, the button mechanically moving down and electricity flowing through a wire. When you train 2 agents, they probably won’t pick the same proxy. So one of the agents wants to maximize the amount of human DNA near the button, and the other wants no electricity to flow through the wire. So the AI’s work together to unplug the button and ram it into the human. (More than just a finger nearby). Making sure that your AI’s are totally and perfectly adversarial, and that there isn’t some weird state of the world that they both like is hard. I can’t see any way to do it with reinforcement learning style approaches.
And actually, its even worse, the AI’s have to agree on epistemics as well. If one AI thinks that the Higgs constant is 123, and the other thinks that its 124, they could cooperate to build another AI, one that will try to figure out the Higgs constant, and press the button if and only if the result is 123. (assuming that this is a disagreement that persists under common knowledge, ie its based on a difference in priors. )
It seems even worse than any of that. If your AI wanted anything at all it might debate well in order to survive. So if you are banking on it single-mindedly wanting to win the debate then you were already in deep trouble.
“winning the debate” is a leaky abstraction. There are many possible ways that the debate could be messed with.
The hope of debate as I see it, is that if the AI’s are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI’s doesn’t like it. If one of the AI’s can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.
It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn’t produce these conditions. Of course, this depends on the context, and it might well not be true.
If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of “winning the debate” is a good abstraction. If, within this domain, the AI’s utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.
This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, “make sure that a finger touches this button” and “make sure that electricity flows under this button”, assuming these are perfectly correlated during training.
Debate could work with either proxy, so long as both debating AI’s use the same proxy.
If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.
I agree with what Paul and Donald are saying, but the post was trying to make a different point.
Among various things needed to “make debate work”, I see three separate sub-problems:
(A) Ensuring that “agents use words to get a human to select them as the winner; and that this is their only terminal goal” is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human’s head to explode and their body falls on the reward button, this doesn’t count.)
(B) Having already accomplished (A), ensure that “agents use words to convince the human that their answer is better” is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)
(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.
While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI’s signals), you have already made a mistake.
On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal—most solution to (B) will be compatible with most solutions to (C).
For (A), Michael Cohen’s Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.)
Michael’s recent “AI Debate” Debate post seems to be primarily concerned about (B).
Finally, this post could be rephrased as “When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.”.
I think I understood the first three paragraphs. The AI “ramming a button to the human” clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario—by preventing the AI from doing this (boxing), ensuring it doesn’t want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase “assume, for the sake of argument, that you have solved all the ‘obvious’ problems”.
Or did you have something else in mind by the first three paragraphs?
I didn’t understand the last paragraph. Or rather, I didn’t understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.
The point of the last paragraph was that if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want.
If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won’t do much of anything. For the ??? AI that doesn’t want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.
if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you—hating me—believe it is poison, we will happily coordinate towards me eating it.
I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don’t see how this would happen, but that isn’t a particularly strong argument.)
Also, the debaters better be comparably smart.
Yes, this seems like a necessary assumption in a symmetric debate.
Once again, this is trivially satisfied if the debaters are copies of each other.
It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)
I think that one problem with debate is improper adversarialization.
“Winning the debate” does not refer directly to a particular set of quantum wave functions. What exactly are the AIs optimizing for? Because if the AIs are optimizing for things that seem opposite, but aren’t, you get the cooperative hacking failure mode.
Suppose that each AI has a wire in its own circuitry, with its utility function being linear in the voltage in the wire. The human has a switch, that can direct a pulse of 1v to either wire. The AI’s work together to persuade the human to to cut open their cases, pull out the wires and attach them to a high voltage power source.
If these systems are something like reinforcement learning agents, then there were many different events that were all strongly correlated in training. These events include large amounts of human DNA (in the form of a pressing finger) being just above the button, the button mechanically moving down and electricity flowing through a wire. When you train 2 agents, they probably won’t pick the same proxy. So one of the agents wants to maximize the amount of human DNA near the button, and the other wants no electricity to flow through the wire. So the AI’s work together to unplug the button and ram it into the human. (More than just a finger nearby). Making sure that your AI’s are totally and perfectly adversarial, and that there isn’t some weird state of the world that they both like is hard. I can’t see any way to do it with reinforcement learning style approaches.
And actually, its even worse, the AI’s have to agree on epistemics as well. If one AI thinks that the Higgs constant is 123, and the other thinks that its 124, they could cooperate to build another AI, one that will try to figure out the Higgs constant, and press the button if and only if the result is 123. (assuming that this is a disagreement that persists under common knowledge, ie its based on a difference in priors. )
It seems even worse than any of that. If your AI wanted anything at all it might debate well in order to survive. So if you are banking on it single-mindedly wanting to win the debate then you were already in deep trouble.
“winning the debate” is a leaky abstraction. There are many possible ways that the debate could be messed with.
The hope of debate as I see it, is that if the AI’s are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI’s doesn’t like it. If one of the AI’s can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.
It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn’t produce these conditions. Of course, this depends on the context, and it might well not be true.
If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of “winning the debate” is a good abstraction. If, within this domain, the AI’s utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.
This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, “make sure that a finger touches this button” and “make sure that electricity flows under this button”, assuming these are perfectly correlated during training.
Debate could work with either proxy, so long as both debating AI’s use the same proxy.
If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.
I agree with what Paul and Donald are saying, but the post was trying to make a different point.
Among various things needed to “make debate work”, I see three separate sub-problems:
(A) Ensuring that “agents use words to get a human to select them as the winner; and that this is their only terminal goal” is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human’s head to explode and their body falls on the reward button, this doesn’t count.)
(B) Having already accomplished (A), ensure that “agents use words to convince the human that their answer is better” is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)
(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.
While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI’s signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal—most solution to (B) will be compatible with most solutions to (C).
For (A), Michael Cohen’s Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael’s recent “AI Debate” Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as “When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.”.
I think I understood the first three paragraphs. The AI “ramming a button to the human” clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario—by preventing the AI from doing this (boxing), ensuring it doesn’t want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase “assume, for the sake of argument, that you have solved all the ‘obvious’ problems”.
Or did you have something else in mind by the first three paragraphs?
I didn’t understand the last paragraph. Or rather, I didn’t understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.
The point of the last paragraph was that if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want.
If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won’t do much of anything. For the ??? AI that doesn’t want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.
Also, the debaters better be comparably smart.
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you—hating me—believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don’t see how this would happen, but that isn’t a particularly strong argument.)
Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)