I think this might be widely believed and also wrong, or at least not well-supported. What makes training with RL(HF) inherently produce āagentsā in a way that finetuning on high-capability trajectories, or pretraining on datasets from human agents does not? IMO the community would think more clearly if we just talked about policies and policy optimization algorithms (of which teacher-forced finetuning, decision-transformer stuff, PPO, etc. are examples) directly rather than leaning on our confused understanding of the agency concept.
The difference is stark in their reactions to variance. RLHF wants to eliminate variance that might make a material difference in the trajectory (when the KL penalty is small relative to the Bayesian-updating KL penalty), while conditioning on rating still tries to produce something that looks like the training distribution.
Why is it that you think that decision transformer-style conditioning preserves variance better than PPO on a reward model? I donāt recall seeing any LLMs that use that method. The closest we have is the text-davinci-002 āFeedMeā model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF. If what you say is true, we should see stark differences in variability/ācreativity between text-davinci-002 and text-davinci-003, no?
Another benefit is quantilization. RLHF is trying to get the highest score available, even if it means exploiting human biases. If instead you condition on a score thatās high but still regularly gotten by humans, itās like youāre sampling policies that get this high-but-not-too-high score, which are less exploitative of human raters than the absolute maximum-score policy.
I disagree. In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however youād like (for ex. via top-p, which is a quantilization-flavored sampling algorithm). Also not sure in what sense you meant that āRLHF is trying to get the highest score possibleā. This seems like itās false in reference to the actual policies found by the algorithm (reward is not the optimization target yadda yadda), questionable in reference the learning algorithm absent something like a convergence guarantee, and false in reference to the overall humans-training-a-model process (i.e. researchers in practice check to see whether the model is just producing bad samples that happen to trick the reward model).
The closest we have is the text-davinci-002 āFeedMeā model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF.
I donāt actually understand the text-davinci-002 training method, so Iāll have to look into it more laterāthanks for the push! I think itās possible that fine-tuning on ratings of the modelās own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like ā97 is the most random number.ā
I would say the closest we have is people finetuning language models on more narrow decision transformer tasks, like chess. But I admit Iāve never seen anyone check for mode collapse in such a case, which now seems like a useful thing to check.
In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however youād like (for ex. via top-p, which is a quantilization-flavored sampling algorithm).
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when youāre training something with a proper scoring rule.
Hereās a very slightly more specific way of thinking about what I mean by āagent acting in a text-universeā (which is related to how I think of the output type as something other than a probability distribution). When you have a KL loss term relative to the prior distribution, that gives you a ābudget,ā that you get to spend on some function f(a|s) that that multiplies Ļ0(a|s). The more f deviates from 1, the more it costs (like log(f)). And the RLHF loss is doing something like trying to get a good score while staying under-budget on f.
This suggests that you can think of [log(f(a|s)) for a] as the policy of an agentāi.e. how āitā wants to spend its budget in each stateārather than fā Ļ0, which is what the whole system actually outputs at each state. This is what I think of colorfully as āan agent living in a text-universe,ā where the ātext-universeā is the dynamics Ļ0 that the āagentā has only limited ability to push around.
Also not sure in what sense you meant that āRLHF is trying to get the highest score possibleā. This seems like itās false
I think itās possible that fine-tuning on ratings of the modelās own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like ā97 is the most random number.ā
I guess thatās possible, but I donāt see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is whatās causing this issue š). Iām a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when youāre training something with a proper scoring rule.
I agree that with the pure self-supervised case, itās clearer which distribution weāre approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing āagentsā.
Iām a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing āagentsā.
I think this is a different interpretation, actually, because itās a different division of whatās āenvironmentā and whatās āagent.ā Talking about base GPT as doing behavior cloning means taking the āenvironmentā to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose āgoalā is producing whatever output it produced), but the tradeoff is the āagentā abstraction isnāt helping you compress your data.
This was definitely a hot take. š„
I think this might be widely believed and also wrong, or at least not well-supported. What makes training with RL(HF) inherently produce āagentsā in a way that finetuning on high-capability trajectories, or pretraining on datasets from human agents does not? IMO the community would think more clearly if we just talked about policies and policy optimization algorithms (of which teacher-forced finetuning, decision-transformer stuff, PPO, etc. are examples) directly rather than leaning on our confused understanding of the agency concept.
Why is it that you think that decision transformer-style conditioning preserves variance better than PPO on a reward model? I donāt recall seeing any LLMs that use that method. The closest we have is the text-davinci-002 āFeedMeā model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF. If what you say is true, we should see stark differences in variability/ācreativity between text-davinci-002 and text-davinci-003, no?
I disagree. In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however youād like (for ex. via top-p, which is a quantilization-flavored sampling algorithm). Also not sure in what sense you meant that āRLHF is trying to get the highest score possibleā. This seems like itās false in reference to the actual policies found by the algorithm (reward is not the optimization target yadda yadda), questionable in reference the learning algorithm absent something like a convergence guarantee, and false in reference to the overall humans-training-a-model process (i.e. researchers in practice check to see whether the model is just producing bad samples that happen to trick the reward model).
Thanks for this great comment.
I donāt actually understand the text-davinci-002 training method, so Iāll have to look into it more laterāthanks for the push! I think itās possible that fine-tuning on ratings of the modelās own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like ā97 is the most random number.ā
I would say the closest we have is people finetuning language models on more narrow decision transformer tasks, like chess. But I admit Iāve never seen anyone check for mode collapse in such a case, which now seems like a useful thing to check.
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when youāre training something with a proper scoring rule.
Hereās a very slightly more specific way of thinking about what I mean by āagent acting in a text-universeā (which is related to how I think of the output type as something other than a probability distribution). When you have a KL loss term relative to the prior distribution, that gives you a ābudget,ā that you get to spend on some function f(a|s) that that multiplies Ļ0(a|s). The more f deviates from 1, the more it costs (like log(f)). And the RLHF loss is doing something like trying to get a good score while staying under-budget on f.
This suggests that you can think of [log(f(a|s)) for a] as the policy of an agentāi.e. how āitā wants to spend its budget in each stateārather than fā Ļ0, which is what the whole system actually outputs at each state. This is what I think of colorfully as āan agent living in a text-universe,ā where the ātext-universeā is the dynamics Ļ0 that the āagentā has only limited ability to push around.
Yeah, this was sloppy.
I guess thatās possible, but I donāt see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is whatās causing this issue š). Iām a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
I agree that with the pure self-supervised case, itās clearer which distribution weāre approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing āagentsā.
Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
I think this is a different interpretation, actually, because itās a different division of whatās āenvironmentā and whatās āagent.ā Talking about base GPT as doing behavior cloning means taking the āenvironmentā to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose āgoalā is producing whatever output it produced), but the tradeoff is the āagentā abstraction isnāt helping you compress your data.