This is really cool! I’m impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?
We didn’t see any of that, thankfully, but that of course doesn’t rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like “At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage.” Thankfully training the judge in parallel made this a nonissue, but I think that it’s clear that we’ll have to watch out for reward hacking of the judge in the future.
This is really cool! I’m impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?
We didn’t see any of that, thankfully, but that of course doesn’t rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like “At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage.” Thankfully training the judge in parallel made this a nonissue, but I think that it’s clear that we’ll have to watch out for reward hacking of the judge in the future.