RogerDearnaley comments on Trying to disambiguate different questions about whether RLHF is “good”

RogerDearnaley 19 Dec 2023 7:27 UTC
LW: 7 AF: 3
0
AF
But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.
There is an alternative. Rather that applying the reward signal to the model’s output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training process. (The problem with this is that it’s heavily off-policy: the reward feedback is less concentrated in the region of the policy, so you likely need more of it.)
What links here?
- Noosphere89's comment on The Hopium Wars: the AGI Entente Delusion by Max Tegmark (13 Oct 2024 19:36 UTC; 33 points)