AI Unsafety via Non-Zero-Sum Debate

In this post, I describe how to view debate as a way of assisting a human to spot flaws in an AI’s proposal. I then argue that the zero-sum assumption is critical for making debate work and that various seemingly-helpful modifications of debate might break it instead.

A naive way of using arbitrary optimizers as oracles: Suppose you have a black-box optimizer that can be connected to any well-defined quantity to be maximized. can potentially be very powerful—e.g., having a highly accurate model of the world and “a lot of optimization power”. One way to turn into an oracle is to ask it a question and decide to give it reward 1 if we like its answer and 0 if we don’t.[1] Of course, standard AI-safety arguments (e.g., AI takeover and perverse instantiation) suggest that this is a pretty bad idea for powerful . For the sake of argument, suppose that we can fix all of the “obvious” problems and ensure that X won’t wirehead, won’t try to escape the box we put it in etc., and will only care about the reward it gets for its answer.

Two problems with naive optimizers-turned-oracles: (1) telling the difference between good and awesome answers and (2) answers with hidden flaws: One problem with this type of oracles is that it’s hard to decide whether we like its answers or not. Suppose I ask it for food recommendations for the evening and it suggests pancakes. Pancakes seem fine, although there are some foods that I would like better. So should I reward the AI or not? The second problem is that the oracle optimizes for giving answers that seem good to a human. (Not out of malice, but because “actually being good” isn’t well-defined.) And since humans aren’t omniscient, there will be many seemingly good answers that in fact have disastrous consequences if acted upon.

To address (1), use two AIs: The first problem can be tackled by using two copies of the optimizer and rewarding the one whose answer we like better (flipping a coin if indifferent). This helps because humans—or at least the one writing this text—find it much easier to compare the relative strengths of two options than to try assigning some absolute value to a single option. (E.g., chocolate ice-cream is clearly better than vanilla while the question “On a scale from 0 to 10, how good is vanilla?” is just weird.)

To mitigate (2), have a second AI find the flaws for you: We can decrease the number of unsafe answers coming from the oracle (call it ) by using a second copy of the optimizer (call it ) that we ask to expose flaws in A’s answers. To set up the right incentives, we give the reward 1 to B and 0 to if finds some flaw (read, “if it seems to us that what it pointed out is a flaw”) and reward 1 to and 0 to if it finds none. This helps since there will be flaws that are too hard to find for us, but easy enough to confirm once pointed out. However, it doesn’t solve all problems (since there could be hard-to-confirm flaws or flaws that are too hard for the AI to find). Moreover, there will be things that seem like flaws but aren’t, which gives a too large advantage. To mitigate this advantage, we can make the flaw-exposure interactive, by allowing to react to ’s claims and vice versa.

AI Safety via Debate = Combination of these two modifications: These two improvements can be used simultaneously. The simple way of doing this is by first having give an answer and attack it, then having give an answer and attack it (both being interactive), and finally deciding which of the two answers we like better. Alternatively, to make the setup more balanced, we can have and give answers simultaneously, after which they exchange arguments that either support their own claim or attack the opponent’s (and we pick the answer that seems better). From the outside, this setup can be viewed as a single oracle (since it takes a question as an input and returns the winning answer as an output).

Debate-as-oracle isn’t always safe, but it is safer than naive optimizers-turned-oracles: Denote by the optimizer we started with, by the corresponding naive oracle, and by the debate-used-as-oracle. Ultimately, whether ’s answer is safe or not depends on how good the human judge is (and whether it can be hacked). If you asked whether the Earth revolves around the Sun in the middle ages, I wouldn’t be sure whether to bet on giving you the right answer. However, the individual instances of in will avoid giving answers with obvious flaws, since those would lead to losing the debate, which makes it safer than plain . In this framing, we need to ask “how safe is debate?”.

Debate with child safety on—only provably safe questions allowed: My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which. In practice, I would thus aim for identifying an as-large-as-possible class of questions for which we have strong reasons to believe that debate is safe (ideally a proof, but not necessarily). And then I would only allow using debate for these safe questions. In this framing, the question becomes “how strong is provably-safe debate?”.

Many well-meaning “improvements” can break debate: One can consider various modifications to the baseline AISvD proposal. On the one hand, there are improvements like OpenAI’s cross-examination which simplifies exposing inconsistencies in claims, which will hopefully make debate strictly safer. On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans—if it seems like the modification would create a conflict of interests, one should be wary.


I have argued that if you use debate as a way of assisting a human with verifying the safety of an AI’s answers, it is critical to have fully adversarial rewards. However, there might be other viable applications of argumentation where things that break “normal” debates from AISvD become desirable instead (such as penalizing both AIs if the human becomes confused). I think it makes sense to pursue such applications. However, to avoid confusion (or worse yet, unpleasant AI-surprises), it is important to be explicit about which application one has in mind.

Incidentally, I feel that the interpretation of debate described in this post is the one that people should use by default in relation to AISvD. (Primarily because if you have a different purpose in mind, such as enhancing the judge’s reasoning, I don’t see good arguments for why this type of debate would be the tool to use.) However, I am quite uncertain about this and would love to know the opinion of people who are closer to the centre of the debate-world :-).

This post was heavily inspired by discussions with Nandi Schoots (and benefited from her comments).

  1. I focus on this scenario, as opposed to the version where you only assign rewards once you have seen what the advice led to. This alternative has its own flaws, and I think that most of the analysis is insensitive to which of the options we pick. Similarly, I suspect that many of the ideas will also apply to the case where debate simply executes a trained policy instead of doing optimization. ↩︎