Stephen McAleese comments on Evaluating the historical value misspecification argument

Stephen McAleese 13 Nov 2024 10:02 UTC
1 point
0
I agree. I don’t see a clear distinction between what’s in the model’s predictive model and what’s in the model’s preferences. Here is a line from the paper “Learning to summarize from human feedback”:
“To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x.”
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.