cfoster0 comments on Take 13: RLHF bad, conditioning good.

cfoster0 22 Dec 2022 17:14 UTC
7 points
3
This was definitely a hot take. 🔥
RLHF is sorta training the AI to be an agent
I think this might be widely believed and also wrong, or at least not well-supported. What makes training with RL(HF) inherently produce “agents” in a way that finetuning on high-capability trajectories, or pretraining on datasets from human agents does not? IMO the community would think more clearly if we just talked about policies and policy optimization algorithms (of which teacher-forced finetuning, decision-transformer stuff, PPO, etc. are examples) directly rather than leaning on our confused understanding of the agency concept.
The difference is stark in their reactions to variance. RLHF wants to eliminate variance that might make a material difference in the trajectory (when the KL penalty is small relative to the Bayesian-updating KL penalty), while conditioning on rating still tries to produce something that looks like the training distribution.
Why is it that you think that decision transformer-style conditioning preserves variance better than PPO on a reward model? I don’t recall seeing any LLMs that use that method. The closest we have is the text-davinci-002 “FeedMe” model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF. If what you say is true, we should see stark differences in variability/creativity between text-davinci-002 and text-davinci-003, no?
Another benefit is quantilization. RLHF is trying to get the highest score available, even if it means exploiting human biases. If instead you condition on a score that’s high but still regularly gotten by humans, it’s like you’re sampling policies that get this high-but-not-too-high score, which are less exploitative of human raters than the absolute maximum-score policy.
I disagree. In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however you’d like (for ex. via top-p, which is a quantilization-flavored sampling algorithm). Also not sure in what sense you meant that “RLHF is trying to get the highest score possible”. This seems like it’s false in reference to the actual policies found by the algorithm (reward is not the optimization target yadda yadda), questionable in reference the learning algorithm absent something like a convergence guarantee, and false in reference to the overall humans-training-a-model process (i.e. researchers in practice check to see whether the model is just producing bad samples that happen to trick the reward model).
- Charlie Steiner 23 Dec 2022 2:07 UTC
  4 points
  0
  Parent
  Thanks for this great comment.
  The closest we have is the text-davinci-002 “FeedMe” model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF.
  I don’t actually understand the text-davinci-002 training method, so I’ll have to look into it more later—thanks for the push! I think it’s possible that fine-tuning on ratings of the model’s own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like “97 is the most random number.”
  I would say the closest we have is people finetuning language models on more narrow decision transformer tasks, like chess. But I admit I’ve never seen anyone check for mode collapse in such a case, which now seems like a useful thing to check.
  In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however you’d like (for ex. via top-p, which is a quantilization-flavored sampling algorithm).
  I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you’re training something with a proper scoring rule.
  Here’s a very slightly more specific way of thinking about what I mean by “agent acting in a text-universe” (which is related to how I think of the output type as something other than a probability distribution). When you have a KL loss term relative to the prior distribution, that gives you a “budget,” that you get to spend on some function $f (a | s)$ that that multiplies $π_{0} (a | s)$ . The more $f$ deviates from 1, the more it costs (like $l o g (f)$ ). And the RLHF loss is doing something like trying to get a good score while staying under-budget on $f$ .
  This suggests that you can think of $[l o g (f (a | s)) for a]$ as the policy of an agent—i.e. how “it” wants to spend its budget in each state—rather than $f \cdot π_{0}$ , which is what the whole system actually outputs at each state. This is what I think of colorfully as “an agent living in a text-universe,” where the “text-universe” is the dynamics $π_{0}$ that the “agent” has only limited ability to push around.
  Also not sure in what sense you meant that “RLHF is trying to get the highest score possible”. This seems like it’s false
  Yeah, this was sloppy.
  - cfoster0 27 Dec 2022 2:26 UTC
    3 points
    0
    Parent
    I think it’s possible that fine-tuning on ratings of the model’s own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like “97 is the most random number.”
    I guess that’s possible, but I don’t see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is what’s causing this issue 😛). I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
    I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you’re training something with a proper scoring rule.
    I agree that with the pure self-supervised case, it’s clearer which distribution we’re approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
    Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
    - Charlie Steiner 27 Dec 2022 8:34 UTC
      3 points
      0
      Parent
      I’m a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
      Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
      under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing “agents”.
      I think this is a different interpretation, actually, because it’s a different division of what’s “environment” and what’s “agent.” Talking about base GPT as doing behavior cloning means taking the “environment” to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose “goal” is producing whatever output it produced), but the tradeoff is the “agent” abstraction isn’t helping you compress your data.