This post points out that <@debate@>(@AI safety via debate@) relies crucially on creating a zero-sum game in order to ensure that the debaters point out flaws in each other’s arguments. For example, if you modified debate so that both agents are penalized for an inconclusive debate, then an agent may decide not to point out a flaw in an argument if it believes that it has some chance of confusing the judge.
Planned summary for the Alignment Newsletter: