Bravo, I’ve been wondering if this was possible for awhile now—since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!
PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you’ve honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.
That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.
Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.
Bravo, I’ve been wondering if this was possible for awhile now—since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!
PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you’ve honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.
That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.
Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.