In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I’m curious if you agree with that assessment.
So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?
Awesome work!
In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I’m curious if you agree with that assessment.
So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?
FYI, See this paper for a more detailed explanation of the model in question and how it’s trained.