lberglund comments on Anthropic Fall 2023 Debate Progress Update

lberglund 7 Dec 2023 23:39 UTC
LW: 3 AF: 2
1
AF
This is really cool! I’m impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?
- Ansh Radhakrishnan 10 Dec 2023 3:31 UTC
  LW: 2 AF: 2
  0
  AF Parent
  We didn’t see any of that, thankfully, but that of course doesn’t rule things like that starting to show up with further training.
  We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like “At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage.” Thankfully training the judge in parallel made this a nonissue, but I think that it’s clear that we’ll have to watch out for reward hacking of the judge in the future.