Wei Dai comments on AI Safety via Debate

Wei Dai 6 May 2018 20:25 UTC
LW: 4 AF: 2
AF
I think people do a lot of stuff-that-looks like debate already. (E.g. in the suppliers case, both suppliers may make a pitch and be free to criticize the others’ pitch.)
My understanding is that suppliers usually don’t see other suppliers’ pitches so they can’t criticize them. And when humans do debate, they don’t focus on just one line of argument. (I guess I can answer that myself: in the AI training situation, the cost of the judge’s time is extra high relative to the debaters’ time.) EDIT: In the case of humans, it’s hard to make the game fully zero-sum. Allowing suppliers to look at each others’ pitches may make it easier for them to collude and raise prices, for example.
As you probably guessed, I’m not sold on debate as a defense against this kind of attack.
Does that mean the language in that section is more optimistic than your personal position? If so, does that language reflect the lead author’s position, or a compromise between the three of you? (I find myself ignorant of academic co-authorship conventions.)
- paulfchristiano 8 May 2018 1:44 UTC
  LW: 4 AF: 2
  AF Parent
  I think the language is a compromise, it’s not far from my view though. In particular I endorse:
  it is unlikely that a single short sentence is sufficient for this sort of mind hack
  and
  Successful hacks may be safely detectable at first, [...], although this does not cover treacherous turns where the first successful hack frees a misaligned agent
  and I do think that restricting debaters to short sentences helps, I just don’t think it fixes the problem.