johnswentworth comments on Goodhart’s Law in Reinforcement Learning

johnswentworth 16 Oct 2023 21:03 UTC
LW: 37 AF: 16
11
AF
Very cool math (and clear post), but I think this formulation basically fails to capture the central Goodheart problem.
Relevant slogan: Goodheart is about generalization, not approximation.
A simple argument for that slogan: if we have a “true” utility $U (X)$ , and an approximation $U^{'} (X)$ which is always within $ϵ$ of $U$ , then optimizing $U^{'}$ achieves a utility under $U$ within $2 ϵ$ of the optimal $U$ . So optimizing an approximation of the utility function yields an outcome which is approximately-optimal under the true utility function… if the approximation holds well everywhere.
In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime. For instance, consider the classic case in which the British offered a reward for dead snakes (hoping to kill them off), and people responded by farming snakes. That reward did not even approximately reflect what the British wanted, once snake-farms entered the picture. Or the Soviet nail factories, rewarded for number of nails produced, which produced huge numbers of tiny useless nails—once in that regime, the reward just completely failed to approximate what the central bureau wanted. These are failures of generalization, not approximation.
So my concern with the OP is that it relies on an approximation assumption:
if we have a bound on the angle $θ$ between the true and the proxy, then we can detect when there is at least one policy within a cone of $θ$ around $R_{1}$ whose return goes down
… and that’s not really how Goodheart works. Goodheart isn’t about cases where the proxy approximates the “true” goal, it’s about cases where the approximation just completely breaks down in some regimes.
What links here?
- dxu's comment on Genetic fitness is a measure of selection strength, not the selection target by Kaj_Sotala (5 Nov 2023 3:39 UTC; 4 points)
- Dalcy's comment on satchlj’s Shortform by Satya Benson (6 May 2025 20:57 UTC; 4 points)
- jacob_cannell 16 Oct 2023 21:50 UTC
  10 points
  1
  Parent
  
  Goodheart isn’t about cases where the proxy approximates the “true” goal, it’s about cases where the approximation just completely breaks down in some regimes.
  
  I agree with your general sentiment but how does that translate formally to the math framework of the paper? Or in other words, where does their formulation diverge from reality?
  
  Perhaps it’s in how they define the occupancy function over explicit (state, action) pairs—Seems like the occupancy measure doesn’t actually weight by state probability correctly? which seems odd—so you could have 2 reward functions that seem arbitrarily aligned (dot product close to 1), but only because they agree on the vast volume of highly improbable states, and not the tiny manifold of likely states.
  
  Moreover in reality the state space is essentially infinite and everything must operate in a highly compressed model space for generalization regardless, so even if the ‘true’ unknown utility function can be defined over the (potentially infinite) state space, any practical reward proxy can not—it is a function of some limited low dimensional encoding of the state space, such that the mapping from that to the full state space is highly nonlinear and complex. We can’t realistically expand that function to the true full state space, nor expect the linearity to translate into the compressed model space. Tiny changes in the model space can translate to arbitrary jumps in the full state space, updates to the model compression function (ontology shifts) can shift everything around, etc.
  - OliverHayman 16 Oct 2023 22:36 UTC
    10 points
    6
    Parent
    So here’s a thing that I think John is pointing at, with a bit more math?:
    
    The diversion is in the distance function.
    
    - In the paper, we define the distance between rewards as the angle between reward vectors.
    - So what we sort of do is look at the “dot product”, i.E., look at $E [R_{1} (S, A) \cdot R_{2} (S, A)]$ for true and proxy rewards $R_{1}$ and $R_{2}$ with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
    But the issue here is that this isn’t the distribution of the actions/states we might see in practice. $E [R_{1} (S, A) \cdot R_{2} (S, A)]$ might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of “killing lots of snakes without doing anything sneaky to game the reward” in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
    
    In other words the way we formally define reward distance here will often not match how “close” two reward functions seem, and lots of cases of “Goodharting” are cases where two reward functions just seem close on a particular state/action distribution but aren’t close according to our distance metric.
    
    This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
    - OliverHayman 16 Oct 2023 22:54 UTC
      4 points
      3
      Parent
      So more concretely, this is work towards some sort of RLHF training regime that “provably” avoids Goodharting. The main issue is that a lot of the numbers we’re using are quite hard to approximate.
    - jacob_cannell 16 Oct 2023 23:18 UTC
      3 points
      0
      Parent
      Thanks that clarifies somewhat, but guess I’ll need to read the paper—still a bit confused about the justification for a uniform distribution.
      
      with states/actions sampled according to a uniform distribution. I give justification as to why this is a very natural way to define distance in a separate comment.
      
      A uniform distribution actually seems like a very weird choice here.
      
      Defining utility functions over full world states seems fine (even if not practical at larger scale), and defining alignment as dot products over full trajectory/state space utility functions also seems fine, but only if using true expected utility (ie the actual bayesian posterior distribution over states). That of course can get arbitrarily complex.
      
      But it also seems necessary in that for one to say that two utility functions are truly ‘close’ seems like that must cash out to closeness of (perhaps normalized) expected utilities given the true distribution of future trajectories.
      
      Do you see a relation between the early stopping criteria and regularization/generalization of the proxy reward?
      - OliverHayman 16 Oct 2023 23:31 UTC
        6 points
        4
        Parent
        The reason we’re using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it’s encoding something about generalization.