LawrenceC comments on Trying to disambiguate different questions about whether RLHF is “good”

LawrenceC 15 Dec 2022 17:09 UTC
LW: 8 AF: 4
0
AF
Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
This seems correct insofar as your proxy reward does not have huge upward errors (that you don’t remove via some sort of clipping). For example, if there’s 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a ¹⁄_10,000 chance of sampling the adversarial sentence, while it’s very tricky (if not impossible) to correctly set the KL penalty so you end up optimizing reward without just outputting the adversarial sentence over and over again.
I don’t feel totally satisfied by this argument though, because RL with KL penalty generally seems kind of unprincipled to me. I’d
I don’t think it’s particularly unprincipled; the KL penalty is just our way of encoding that the base policy is relatively reasonable (in a value-free way), and the model shouldn’t deviate from it too much. Similarly, BoN is another way of encoding that the base policy is reasonable, albeit one that’s easier to reason about on average.
Another way people regularize their RL is to mix in self-supervised training with RL (for example, they did this for text-davinci-003). I’m pretty sure this is also equivalent to RL w/ KL penalty.
I am unsure whether other reasonable models of RL will similarly not have the causal Goodhart problem.
There’s the impact penalties approach (e.g. AUP or RLSP), which seem more principled than KL penalties in cases where you more information on the space of rewards.
You can also write do constrained optimization, which is again equivalent to modifying the reward function, but is way easier for humans to reason about and give feedback on.