Vanessa Kosoy comments on How should AI debate be judged?

Vanessa Kosoy 27 Jul 2020 13:43 UTC
LW: 4 AF: 3
AF
...if both players give the same answer there is no training signal.

Why? If both players give the same answer, this only means their reward on this round is $0$ out of $[- 1, + 1]$ . But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.

But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.

How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability $p_{1}$ to A1′s answer and probability $p_{2}$ to A2′s answer. The rewards work as follows:
- If A1 is the proponent, it gets reward $p_{1}$ , and if it is the opponent, it gets reward $- p_{2}$ .
- If A2 is the proponent, it gets reward $p_{2}$ , and if it is the opponent, it gets reward $- p_{1}$ .
- B1 gets reward $p_{1} - p_{2}$ .
- B2 gets reward $p_{2} - p_{1}$ .
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.