The default outcome of debate does not look promising. But there’s a good deal of room to improve on the default.
Maybe half the problem with public discourse is that people have social goals that distract them from reality. I’m not confident that AI researchers will be more truth-oriented, but I see plenty of room for hope.
Drexler’s CAIS paper describes some approaches that are likely needed to make debate work:
Section 25:
Optimized advice need not be optimized to induce its acceptance
Advice optimized to produce results may be manipulative, optimized to induce a client’s acceptance; advice optimized to produce results conditioned on its acceptance will be neutral in this regard.
Section 20:
Collusion among superintelligent oracles can readily be avoided
C1) To improve the quality of answers, it is natural to implement
multiple, diverse (and implicitly competing) systems to propose
alternatives.
C2) To identify low-quality or misleading answers, it is natural to
employ diverse critics, any one of which could disrupt deceptive
collusion.
C3) Systems of diverse, competing proposers and critics naturally
implement both independent and adversarial objectives.
C4) It is natural to apply fixed (hence memory-free) system
instantiations to multiple problems, incidentally yielding a series of
history-blind, single-move decisions.
C5) It is natural to provide differentiated, task-relevant information
to systems solving different problems, typically omitting knowledge of
general circumstances.
Some of these approaches are costly to implement. That might doom debate.
Success with debate likely depends on the complexity of key issues to be settled by debate, and/or the difficulty of empirically checking proposals.
Eliezer sometimes talks as if we’d be stuck evaluating proposals that are way too complex for humans to fully understand. I expect alignment can be achieved by evaluating some relatively simple, high-level principles. I expect we can reject proposals from AI debaters that are too complex, and select simpler proposals until we can understand them fairly well. But I won’t be surprised if we’re still plagued by doubts at the key junctures.
The default outcome of debate does not look promising. But there’s a good deal of room to improve on the default.
Maybe half the problem with public discourse is that people have social goals that distract them from reality. I’m not confident that AI researchers will be more truth-oriented, but I see plenty of room for hope.
Drexler’s CAIS paper describes some approaches that are likely needed to make debate work: Section 25:
Section 20:
Some of these approaches are costly to implement. That might doom debate.
Success with debate likely depends on the complexity of key issues to be settled by debate, and/or the difficulty of empirically checking proposals.
Eliezer sometimes talks as if we’d be stuck evaluating proposals that are way too complex for humans to fully understand. I expect alignment can be achieved by evaluating some relatively simple, high-level principles. I expect we can reject proposals from AI debaters that are too complex, and select simpler proposals until we can understand them fairly well. But I won’t be surprised if we’re still plagued by doubts at the key junctures.