We’ve <@previously seen@>(@Writeup: Progress on AI Safety via Debate@) work on addressing potential problems with debate, including (but not limited to):
1. Evasiveness: By introducing structure to the debate, explicitly stating which claim is under consideration, we can prevent dishonest debaters from simply avoiding precision.
2. Misleading implications: To prevent the dishonest debater from “framing the debate” with misleading claims, debaters may also choose to argue about the meta-question “given the questions and answers provided in this round, which answer is better?”.
3. Truth is ambiguous: Rather than judging whether answers are _true_, which can be ambiguous and depend on definitions, we instead judge which answer is _better_.
4. Ambiguity: The dishonest debater can use an ambiguous concept, and then later choose which definition to work with depending on what the honest debater says. This can be solved with <@cross-examination@>(@Writeup: Progress on AI Safety via Debate@).
This post presents an open problem: the problem of _obfuscated arguments_. This happens when the dishonest debater presents a long, complex argument for an incorrect answer, where neither debater knows which of the series of steps is wrong. In this case, any given step is quite likely to be correct, and the honest debater can only say “I don’t know where the flaw is, but one of these arguments is incorrect”. Unfortunately, honest arguments are also often complex and long, to which a dishonest debater could also say the same thing. It’s not clear how you can distinguish between these two cases.
While this problem was known to be a potential theoretical issue with debate, the post provides several examples of this dynamic arising in practice in debates about physics problems, suggesting that this will be a problem we have to contend with.
This does seem like a challenging problem to address, and as the authors mention, it also affects iterated amplification. (Intuitively, if during iterated amplification the decomposition chosen happens to be one that ends up being obfuscated, then iterated amplification will get to the wrong answer.) I’m not really sure whether I expect this to be a problem in practice—it feels like it could be, but it also feels like we should be able to address it using whatever techniques we use for robustness. But I generally feel very confused about this interaction and want to see more work on it.
Planned summary for the Alignment Newsletter:
Added an opinion:
I’d be interested to hear more detail of your thoughts on how we might use robustness techniques!