leogao comments on Richard Ngo’s Shortform

leogao 14 Apr 2025 6:23 UTC
12 points
0
i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.^[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
1. ^
  Quick proof sketch (this is a very standard result in RL and is the motivation for advantage estimation, but still good practice to check things).
  The REINFORCE estimator is $\nabla_{θ} R = E_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)]$ .
  WLOG, suppose we define a new reward $R^{'} (τ) = R (τ) + k$ where $k > 0$ (and assume that $E [R] = 0$ , so $R^{'}$ is moving away from the mean).
  Then we can verify the expectation of the gradient is still the same: $\nabla_{θ} R' - \nabla_{θ} R = E_{τ \sim π (\cdot)} [\nabla_{θ} log π (τ)] = \int π (τ) \frac{\nabla_{θ} π (τ)}{π (τ)} d τ = 0$ .
  But the variance increases:
  $V_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)] = \int R (τ)^{2} (\nabla_{θ} log π (τ))^{2} π (τ) d τ - (\nabla_{θ} R (τ))^{2}$
  $V_{τ \sim π (\cdot)} [R^{'} (τ) \nabla_{θ} log π (τ)] = \int (R (τ) + k)^{2} (\nabla_{θ} log π (τ))^{2} π (τ) d τ - (\nabla_{θ} R (τ))^{2}$
  So:
  $V_{τ \sim π (\cdot)} [R^{'} (τ) \nabla_{θ} log π (τ)] - V_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)] = 2 k \int R (τ) (\nabla_{θ} log π (τ))^{2} π (τ) d τ + k^{2} \int (\nabla_{θ} log π (τ))^{2} π (τ) d τ$
  Obviously, both terms on the right have to be non-negative. More generally, if $E [R] = k$ , the variance increases with $O (k^{2})$ . So having your rewards be uncentered hurts a ton.
- Richard_Ngo 21 Apr 2025 4:16 UTC
  3 points
  0
  Parent
  if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around
  I have a mental category of “results that are almost entirely irrelevant for realistically-computationally-bounded agents” (e.g. results related to AIXI), and my gut sense is that this seems like one such result.
  - Garrett Baker 21 Apr 2025 7:05 UTC
    2 points
    0
    Parent
    I mean this situation is grounded & formal enough you can just go and implement the relevant RL algorithm and see if its relevant for that computationally bounded agent, right?
  - leogao 21 Apr 2025 6:45 UTC
    2 points
    0
    Parent
    is this for a reason other than the variance thing I mention?
    I think the thing I mention is still important is because it means there is no fundamental difference between positive and negative motivation. I agree that if everything was different degrees of extreme bliss then the variance would be so high that you never learn anything in practice. but if you shift everything slightly such that some mildly unpleasant things are now mildly pleasant, I claim this will make learning a bit faster or slower but still converge to the same thing.
    - Richard_Ngo 21 Apr 2025 20:42 UTC
      6 points
      2
      Parent
      Suppose you’re in a setting where the world is so large that you will only ever experience a tiny fraction of it directly, and you have to figure out the rest via generalization. Then your argument doesn’t hold up: shifting the mean might totally break your learning. But I claim that the real world is like this. So I am inherently skeptical of any result (like most convergence results) that rely on just trying approximately everything and gradually learning which to prefer and disprefer.
      - leogao 23 Apr 2025 21:15 UTC
        8 points
        2
        Parent
        are you saying something like: you can’t actually do more of everything except one thing, because you’ll never do everything. so there’s a lot of variance that comes from exploration that multiplies with your $O (k^{2})$ variance from having a suboptimal zero point. so in practice your $k$ needs to be very close to optimal. so my thing is true but not useful in practice.
        i feel people do empirically shift $k$ quite a lot throughout life and it does seem to change how effectively they learn. if you’re mildly depressed your $k$ is slightly too low and you learn a little bit slower. if you’re mildly manic your $k$ is too high and you also learn a little bit slower. therapy, medications, and meditations shift $k$ mildly.