cubefox comments on ozziegooen’s Shortform

cubefox 14 Feb 2025 18:17 UTC
2 points
0
Sure, but the fact that a “fix” would even be necessary highlights that RLHF is too brittle relative to slightly OOD thought experiments, in the sense that RLHF misgeneralizes the actual human preference data it was given during training. This could either be a case of misalignment between human preference data and reward model, or between reward model and language model. (Unlike SFT, RLHF involves a separate reward model as “middle man”, because reinforcement learning is too sample-inefficient to work with a limited number of human preference data directly.)