In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7⁄7, and he answered “A mix of previously trained models. Probably very few samples from base models if any” (emphasis mine).
I’m curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only “cloned” them.
(Note that it wasn’t specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7⁄7, and he answered “A mix of previously trained models. Probably very few samples from base models if any” (emphasis mine).
I’m curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only “cloned” them.
(Note that it wasn’t specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
That’s interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.