I think itâs possible that fine-tuning on ratings of the modelâs own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like â97 is the most random number.â
I guess thatâs possible, but I donât see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is whatâs causing this issue đ). Iâm a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when youâre training something with a proper scoring rule.
I agree that with the pure self-supervised case, itâs clearer which distribution weâre approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing âagentsâ.
Iâm a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing âagentsâ.
I think this is a different interpretation, actually, because itâs a different division of whatâs âenvironmentâ and whatâs âagent.â Talking about base GPT as doing behavior cloning means taking the âenvironmentâ to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose âgoalâ is producing whatever output it produced), but the tradeoff is the âagentâ abstraction isnât helping you compress your data.
I guess thatâs possible, but I donât see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is whatâs causing this issue đ). Iâm a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.
I agree that with the pure self-supervised case, itâs clearer which distribution weâre approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.
Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing âagentsâ.
Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.
I think this is a different interpretation, actually, because itâs a different division of whatâs âenvironmentâ and whatâs âagent.â Talking about base GPT as doing behavior cloning means taking the âenvironmentâ to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose âgoalâ is producing whatever output it produced), but the tradeoff is the âagentâ abstraction isnât helping you compress your data.