So here’s a thing that I think John is pointing at, with a bit more math?:
The diversion is in the distance function.
- In the paper, we define the distance between rewards as the angle between reward vectors.
- So what we sort of do is look at the “dot product”, i.E., look at for true and proxy rewards and with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
But the issue here is that this isn’t the distribution of the actions/states we might see in practice. might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of “killing lots of snakes without doing anything sneaky to game the reward” in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
In other words the way we formally define reward distance here will often not match how “close” two reward functions seem, and lots of cases of “Goodharting” are cases where two reward functions just seem close on a particular state/action distribution but aren’t close according to our distance metric.
This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
An important part of the paper that I think is easily missed, and useful for people doing work on distances between reward vectors:
There is some existing literature on defining distances between reward functions (e.g., see Gleave et. al.). However, all proposed distances are only pseudometrics.
A bit about distance functions:
Commonly, two reward functions are defined to be the same (e.g., see Skalse et. al.) if they’re equivalent up to scaling the reward function and introducing potential shaping. By the latter, I mean that two reward functions are the same if one is R and the other is of the form R+γ∗Φ(next state)−Φ(current state) for some function Φ and discount γ. This is because in Ng. et. al. it is shown these make up all reward vectors that we know give the same optimal policy as the original reward across all environments (with the same state/action space).
This leads us to the following important claim:
Projecting reward vectors onto Ω and taking the angle between them is a perfect distance metric according to these desiderata.
Why: It can easily be shown it’s a metric, provided it’s well-defined with the equivalence relation. It can also be shown that the locus of reward functions that give the same projection as R onto Ω is exactly the set of potential-shaped reward functions. Then the claim pretty clearly follows.
In particular, this seems like the most natural “true” reward metric, and I’m not sure any other “true” metrics have even been proposed before this.