I’m surprised no one has brought up the quantilizer results, specifically the quantilizer optimality theorem from Taylor 2015:
Theorem 1 (Quantilizer optimality). Choose q=1/t. Then, a q-quantilizer maximizes expected U-utility subject to constraint 2.
where constraint 2 is that you don’t do more than t worse in expectation on any possible cost function, relative to the original distribution of actions. That is, quantilizers (which are in turn approximated by BoN), are the optimal solution to a particular robust RL problem.
However, it turns out that KL-regularized RL is also optimal with respect to a robust optimization problem, for rewards function “close enough” to the proxy reward! Here’s a proof for the Max Ent RL case here, which transfers to the KL-regularized RL case if you change the measure of dat: to be from the base policy π0:
(In general, you just want to solve the robust RL problem wrt a prior over rewards with a grain of truth that’s also as narrow as possible, is just adversarial AUP.)
I’m surprised no one has brought up the quantilizer results, specifically the quantilizer optimality theorem from Taylor 2015:
where constraint 2 is that you don’t do more than t worse in expectation on any possible cost function, relative to the original distribution of actions. That is, quantilizers (which are in turn approximated by BoN), are the optimal solution to a particular robust RL problem.
However, it turns out that KL-regularized RL is also optimal with respect to a robust optimization problem, for rewards function “close enough” to the proxy reward! Here’s a proof for the Max Ent RL case here, which transfers to the KL-regularized RL case if you change the measure of dat: to be from the base policy π0:
(In general, you just want to solve the robust RL problem wrt a prior over rewards with a grain of truth that’s also as narrow as possible, is just adversarial AUP.)