Charlie Steiner comments on Take 13: RLHF bad, conditioning good.

Charlie Steiner 23 Dec 2022 2:07 UTC
4 points
0
Thanks for this great comment.
The closest we have is the text-davinci-002 “FeedMe” model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF.
I don’t actually understand the text-davinci-002 training method, so I’ll have to look into it more later—thanks for the push! I think it’s possible that fine-tuning on ratings of the model’s own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like “97 is the most random number.”
I would say the closest we have is people finetuning language models on more narrow decision transformer tasks, like chess. But I admit I’ve never seen anyone check for mode collapse in such a case, which now seems like a useful thing to check.
In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however you’d like (for ex. via top-p, which is a quantilization-flavored sampling algorithm).
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you’re training something with a proper scoring rule.
Here’s a very slightly more specific way of thinking about what I mean by “agent acting in a text-universe” (which is related to how I think of the output type as something other than a probability distribution). When you have a KL loss term relative to the prior distribution, that gives you a “budget,” that you get to spend on some function $f (a | s)$ that that multiplies $π_{0} (a | s)$ . The more $f$ deviates from 1, the more it costs (like $l o g (f)$ ). And the RLHF loss is doing something like trying to get a good score while staying under-budget on $f$ .
This suggests that you can think of $[l o g (f (a | s)) for a]$ as the policy of an agent—i.e. how “it” wants to spend its budget in each state—rather than $f \cdot π_{0}$ , which is what the whole system actually outputs at each state. This is what I think of colorfully as “an agent living in a text-universe,” where the “text-universe” is the dynamics $π_{0}$ that the “agent” has only limited ability to push around.
Also not sure in what sense you meant that “RLHF is trying to get the highest score possible”. This seems like it’s false
Yeah, this was sloppy.
- cfoster0 27 Dec 2022 2:26 UTC
  3 points
  0
  Parent
  I think it’s possible that fine-tuning on ratings of the model’s own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like “97 is the most random number.”
  I guess that’s possible, but I don’t see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is what’s causing this issue 😛). I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
  I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you’re training something with a proper scoring rule.
  I agree that with the pure self-supervised case, it’s clearer which distribution we’re approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
  Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
  - Charlie Steiner 27 Dec 2022 8:34 UTC
    3 points
    0
    Parent
    I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
    Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
    under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
    I think this is a different interpretation, actually, because it’s a different division of what’s “environment” and what’s “agent.” Talking about base GPT as doing behavior cloning means taking the “environment” to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose “goal” is producing whatever output it produced), but the tradeoff is the “agent” abstraction isn’t helping you compress your data.