Why I’m excited about Debate

I think Debate is probably the most exciting existing safety research direction. This is a pretty significant shift from my opinions when I first read about it, so it seems worth outlining what’s changed. I’ll group my points into three categories. Points 1-3 are strategic points about deployment of useful AGIs. Points 4-6 are technical points about Debate. Points 7-9 are meta-level points about how to evaluate safety techniques, in particular responding to Beth Barnes’ recent post on obfuscated arguments in Debate.

1. Question-answering is very useful.

People often claim that question-answering AGIs (which I’ll abbreviate as QAGIs) will be economically uncompetitive compared with agentic AGIs. But I don’t think this matters very much for the two most crucial applications of AGIs. Firstly, when it comes to major scientific and technological advances, almost all of the value is in the high-level concepts—it seems unlikely that implementing those advances will require AGI supervision (rather than just supervision from narrow AIs) during deployment.

Secondly, aligned QAGIs can do safety research to help us understand how to build aligned agentic AGIs, and can also predict and prevent their misbehaviour. So even a relatively small lead for aligned QAGIs could be very helpful.

2. Debate pushes capabilities in the right direction.

Another objection I used to have: I tend to expect that QAGIs are pretty safe anyway, which implies that in aligning QA systems, Debate isn’t helping tackle the cases we should be most worried about. And if it’s used as a final step of training for systems that have previously been trained to do other things, then it’s very unclear to me whether Debate would override whatever unsafe motivations those systems had already acquired.

But now I think that Debate could be an important tool for not only making QA systems more aligned, but also more competitive. Systems like GPT-3 have shown a very good understanding of language, and I expect them to gain much more world-knowledge, but it’s hard to elicit specific answers from them. We might hope to make them do so by using reward-modelling to fine-tune them, but I expect that in order to scale this up to complex questions, we’ll need to make that process much more efficient in human time. That’s what Debate does, by allowing humans to evaluate answers on criteria that are much simpler than the holistic question “is this answer good?”

3. Debate provides a default model of interaction with AGIs.

We won’t just interact with agentic AGIs by giving them commands and waiting until they’ve been carried out. Rather, for any important tasks, we’ll also ask those AGIs to describe details of their plans and intentions, and question them on any details which we distrust. And for additional scrutiny, it seems sensible to run these answers past other AGIs. In other words, Debate is a very natural way to think about AGI deployment, and describes a skillset which we should want all our AGIs to have, even if it’s not the main safety technique we end up relying on.

4. Debate implicitly accesses a complex structure

I originally thought that Debate was impractical because debates amongst humans aren’t very truth-conducive. But now I consider it misleading to think about Debate as simply a more sophisticated version of what two humans do. The comparison to a game of Go is illuminating. Specifically, let’s interpret any given Go position as a question: who wins the game of Go starting from this position? Then we can interpret a single game of Go played from that position, by sufficiently strong players, as good evidence that the (exponential) tree of other possible games doesn’t contain a refutation of any of the moves played by the eventual winner. Similarly, the hope is that we can interpret a single line of debate, starting from a given question, as good evidence that the exponential tree of other lines of debate doesn’t contain a refutation of any of the claims made by the eventual debate winner.

In other words, the core insight of Debate is that we can evaluate a whole argumentative tree while only exploring one branch, given a strict standard of judging (i.e. whoever loses that one branch loses the whole tree), because the debaters will model the rest of the tree to the best of their (superhuman) abilities. We can’t do this in normal debates between humans, because human debaters aren’t smart enough for other humans to reliably interpret the outcome of one specific branch of a debate tree as strong evidence about the rest of the tree. Therefore human incentives are very rarely set up to punish minor errors, which may allow compounding inaccuracy. A better analogy than a normal human debate might be to a human debate where, before making each argument, each side can consult a large team of experts on the topic; in this case, it seems much more reasonable to expect both that small mistakes will be caught, and that small mistakes are deliberate lies which should cause the liar to lose the debate. (Punishing minor errors does introduce more variance into AI debates, since the truth-telling debater could get unlucky by making small mistakes. But unlike human debates, we can run AI debates a large number of times, hopefully decreasing that variance significantly.)

The intuition I’ve described makes Debate compare favourably to recursive reward modelling (RRM), which needs to actually implement the whole exponential tree of agents answering subquestions. (I think Jan Leike envisages RRM trees as being much shallower than Debate trees, though.) RRM does has other advantages—in particular its ability to train agents which actually take actions in the world. But as already discussed, I find this less compelling than I used to.

5. Reasoning can be truth-conducive even in adversarial environments

I’m reasonably compelled by Sperber and Mercer’s claim that explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments. [EDIT: More specifically, Sperber and Mercer claim that “reason is not geared to solitary use, to arriving at better beliefs and decisions on our own. What reason does, rather, is help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us.” This still involves evaluations which aim to find out which arguments are right about the world]. I think this frames our current situation in a new light: reasoning fails to track the truth so often not necessarily because it’s a weak tool, but because it’s specifically been selected to promote many of these failures (such as overconfidence in our own claims). And yet, despite that, it’s still reasonably truth-conducive—we can still reason about complex scientific domains, for example. This makes me more optimistic that Debate can be fairly truth-conducive even if the training incentives aren’t quite right.

Consider also that, for existential safety, we only need reasoning to be truth-conducive enough to detect catastrophes. We should expect that doing so is much easier than finding the truth about all (potentially very subtle and nuanced) questions.

6. Debate passes the human relevance test.

An important heuristic I use for evaluating safety research directions: would these work for making humans safer, if applied over evolutionary timeframes? This decomposes into a few different components:

  1. Does the technique apply to “prosaic” machine learning, without requiring deeper insights into cognition? I.e. does it work if we build AGI by optimising a big neural network via local search techniques like gradient descent or evolution?

  2. Does the technique scale up to generally intelligent agents?

  3. Does the technique scale up to very complex environments?

  4. Does the technique work even after a long period of “pre-training” instills unsafe traits in the agent?

I think that Debate passes the first three of these tests, which is a good sign! Most other safety techniques don’t. The fourth criterion is a harder one: it’s analogous to the question of whether a safety technique applied to humans would be able to remove deeply-ingrained goals like our survival instincts, while avoiding deceptive alignment. I’m not convinced that Debate solves this, but I don’t think any other safety technique does so either.

7. Debate promotes empirical investigation

We can consider Debate to be one type of iterated amplification (IA) - but for the purposes of doing good follow-up research, I think the specificity of Debate makes it more valuable than the broad concept of IA in general. In particular, this specificity makes it easier for more people to do conceptual and empirical investigation. This is especially important given that so many ideas in safety are so abstract and hard to ground in concrete research questions. Further, the adversarial framing of Debate makes it easier to spot problems, by pushing us to think about how the setup might be exploited. These factors have already contributed to interesting insights, such as OpenAI’s work on cross-examination.

8. Recent negative results aren’t too discouraging.

Beth recently put a post online with negative results from Debate—in particular, the finding that misleading debate trees can be constructed for which it seems necessary to examine all the steps of the argument in order to refute it.

There are a few reasons I’m not too discouraged by this. Firstly, the physics problems they’re using to evaluate Debate are hard—they are designed to be counterintuitive, and sometimes confuse physics PhDs. Even people who can solve them in one way often can’t evaluate other potential ways of solving them. So the fact that we can generate confusing debate transcripts about them isn’t much evidence about the limitations of Debate on domains which both debaters have thorough knowledge of. Note also that we haven’t yet seen what top-notch physicists with relevant expertise would make of these transcripts; it may be that it only takes a few weeks of practice on these types of everyday physics questions for experts to gain sufficient knowledge to refute misleading arguments.

We might be worried that, if we have such trouble with everyday physics, it’ll be very difficult to scale up to more difficult questions. But note that current difficulties are partly because Debate experiments don’t (yet) allow debaters to make empirical predictions. Given how valuable this step has been for humans, it seems plausible that adding it would make Debate significantly more powerful. A very rough summary of human intellectual history: we tried to make progress via debate for thousands of years, and gained little knowledge (except in maths). Then we started also relying on empirical predictions, and underwent the scientific and industrial revolutions shortly afterwards. Predictions are very powerful tools for cutting through verbal obfuscation, when used in conjunction with verbal reasoning.

9. Recent negative results expect too much from Debate.

The OpenAI team is trying to use Debate to access all the knowledge our agents have. But this seems like an unrealistic goal—consider how incredibly difficult it is in the case of humans. There’s plenty of human knowledge that is very difficult to access (e.g. because it relies on finely-honed intuitions) even when the human in question is being fully cooperative. Indeed, given how much knowledge is tacit or vague, I’m not even sure what it would mean to succeed in accessing all an agent’s knowledge.

From my perspective, if Debate makes it easier to reliably access a part of the debaters’ knowledge, that seems pretty useful. And if it boosts the capabilities of our language models, that would also be great.

More generally, we already knew that there are some problems on which Debate fails—such as cryptographic functions whose solutions are very hard to find. So what we’re trying to figure out is to what extent the most interesting problems fall into that category. But insofar as we’re worried about agents taking harmful actions, we’re worried about Debate arguments with verifiable consequences, and so I expect that leveraging empirical evidence (as discussed previously) will be a big advantage.