LawrenceC comments on Trying to disambiguate different questions about whether RLHF is “good”

LawrenceC 15 Dec 2022 17:25 UTC
LW: 5 AF: 3
0
AF
I’m surprised no one has brought up the quantilizer results, specifically the quantilizer optimality theorem from Taylor 2015:
Theorem 1 (Quantilizer optimality). Choose q=1/t. Then, a q-quantilizer maximizes expected U-utility subject to constraint 2.
where constraint 2 is that you don’t do more than t worse in expectation on any possible cost function, relative to the original distribution of actions. That is, quantilizers (which are in turn approximated by BoN), are the optimal solution to a particular robust RL problem.
However, it turns out that KL-regularized RL is also optimal with respect to a robust optimization problem, for rewards function “close enough” to the proxy reward! Here’s a proof for the Max Ent RL case here, which transfers to the KL-regularized RL case if you change the measure of $d a_{t}$ : to be from the base policy $π_{0}$ :
(In general, you just want to solve the robust RL problem wrt a prior over rewards with a grain of truth that’s also as narrow as possible, is just adversarial AUP.)