cfoster0 comments on Take 13: RLHF bad, conditioning good.

cfoster0 27 Dec 2022 2:26 UTC
3 points
0
I think it’s possible that fine-tuning on ratings of the model’s own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like “97 is the most random number.”
I guess that’s possible, but I don’t see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is what’s causing this issue 😛). I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you’re training something with a proper scoring rule.
I agree that with the pure self-supervised case, it’s clearer which distribution we’re approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
- Charlie Steiner 27 Dec 2022 8:34 UTC
  3 points
  0
  Parent
  I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
  Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
  under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
  I think this is a different interpretation, actually, because it’s a different division of what’s “environment” and what’s “agent.” Talking about base GPT as doing behavior cloning means taking the “environment” to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose “goal” is producing whatever output it produced), but the tradeoff is the “agent” abstraction isn’t helping you compress your data.