Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 13 Apr 2025 19:30 UTC
LW: 43 AF: 23
20
AF
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
- Steven Byrnes 13 Apr 2025 23:18 UTC
  33 points
  4
  Parent
  In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
  So if some idea pops into your head, “maybe I’ll stand up”, and it seems appealing, then you immediately stand up (the human “run”); if it seems unappealing on net, then that thought goes away and you start thinking about something else instead, semi-randomly (the human “tumble”).
  So positive and negative are deeply different. Of course, we should still call this an RL algorithm. It’s just that it’s an RL algorithm that involves a (possibly time- and situation-dependent) heuristic estimator of the expected value of a new random plan (a.k.a. the expected reward if you randomly tumble). If you’re way above that expected value, then keep doing whatever you’re doing; if you’re way below the threshold, re-roll for a new random plan.
  As one example of how this ancient basic distinction feeds into more everyday practical asymmetries between positive and negative motivations, see my discussion of motivated reasoning here, including in §3.3.3 the fact that “it generally feels easy and natural to brainstorm / figure out how something might happen, when you want it to happen. Conversely, it generally feels hard and unnatural to figure out how something might happen, when you want it to not happen.”
- habryka 13 Apr 2025 20:36 UTC
  13 points
  13
  Parent
  This reminds me of a conversation I had recently about whether the concept of “evil” is useful. I was arguing that I found “evil”/”corruption” helpful as a handle for a more model-free “move away from this kind of thing even if you can’t predict how exactly it would be bad” relationship to a thing, which I found hard to express in a more consequentialist frames.
  - tailcalled 14 Apr 2025 10:44 UTC
    2 points
    0
    Parent
    I feel like “evil” and “corruption” mean something different.
    Corruption is about selfish people exchanging their power within a system for favors (often outside the system) when they’re not supposed to according to the rules of the system. For example a policeman taking bribes. It’s something the creators/owners of the system should try to eliminate, but if the system itself is bad (e.g. Nazi Germany during the Holocaust), corruption might be something you sometimes ought to seek out instead of to avoid, like with Schindler saving his Jews.
    “Evil” I’ve in the past tended to take to take to refer to a sort of generic expression of badness (like you might call a sadistic sexual murderer evil, and you might call Hitler evil, and you might call plantation owners evil, but this has nothing to do with each other), but that was partly due to me naively believing that everyone is “trying to be good” in some sense. Like if I had to define evil, I would have defined it as “doing bad stuff for badness’s sake, the inversion of good, though of course nobody actually is like that so it’s only really used hyperbolically or for fictional characters as hyperstimuli”.
    But after learning more about morality, there seem to be multiple things that can be called “evil”:
    Antinormativity (which admittedly is pretty adjacent to corruption, like if people are trying to stop corruption, then the corruption can use antinormativity to survive)
    Coolness, i.e. countersignalling against goodness-hyperstimuli wielded by authorities, i.e. demonstrating an ability and desire to break the rules
    People who hate great people cherry-picking unfortunate side-effects of great people’s activities to make good people think that the great people are conspiring against good people and that they must fight the great people
    Leaders who commit to stopping the above by selecting for people who do bad stuff to prove their loyalty to those leaders (think e.g. the Trump administration)
    I think “evil” is sufficiently much used in the generic sense that it doesn’t make sense to insist that any of the above are strictly correct. However if it’s just trying to describe someone who might unpredictably do something bad then I think I’d use words like “dangerous” or “creepy”, and if it’s just trying to describe someone who carries memes that would unpredictably do something bad then I think I’d use words like “brainworms” (rather than evil).
- leogao 14 Apr 2025 6:23 UTC
  12 points
  0
  Parent
  i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
  i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
  although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.^[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
  my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
  1. ^
    Quick proof sketch (this is a very standard result in RL and is the motivation for advantage estimation, but still good practice to check things).
    The REINFORCE estimator is $\nabla_{θ} R = E_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)]$ .
    WLOG, suppose we define a new reward $R^{'} (τ) = R (τ) + k$ where $k > 0$ (and assume that $E [R] = 0$ , so $R^{'}$ is moving away from the mean).
    Then we can verify the expectation of the gradient is still the same: $\nabla_{θ} R' - \nabla_{θ} R = E_{τ \sim π (\cdot)} [\nabla_{θ} log π (τ)] = \int π (τ) \frac{\nabla_{θ} π (τ)}{π (τ)} d τ = 0$ .
    But the variance increases:
    $V_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)] = \int R (τ)^{2} (\nabla_{θ} log π (τ))^{2} π (τ) d τ - (\nabla_{θ} R (τ))^{2}$
    $V_{τ \sim π (\cdot)} [R^{'} (τ) \nabla_{θ} log π (τ)] = \int (R (τ) + k)^{2} (\nabla_{θ} log π (τ))^{2} π (τ) d τ - (\nabla_{θ} R (τ))^{2}$
    So:
    $V_{τ \sim π (\cdot)} [R^{'} (τ) \nabla_{θ} log π (τ)] - V_{τ \sim π (\cdot)} [R (τ) \nabla_{θ} log π (τ)] = 2 k \int R (τ) (\nabla_{θ} log π (τ))^{2} π (τ) d τ + k^{2} \int (\nabla_{θ} log π (τ))^{2} π (τ) d τ$
    Obviously, both terms on the right have to be non-negative. More generally, if $E [R] = k$ , the variance increases with $O (k^{2})$ . So having your rewards be uncentered hurts a ton.
  - Richard_Ngo 21 Apr 2025 4:16 UTC
    3 points
    0
    Parent
    if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around
    I have a mental category of “results that are almost entirely irrelevant for realistically-computationally-bounded agents” (e.g. results related to AIXI), and my gut sense is that this seems like one such result.
    - Garrett Baker 21 Apr 2025 7:05 UTC
      2 points
      0
      Parent
      I mean this situation is grounded & formal enough you can just go and implement the relevant RL algorithm and see if its relevant for that computationally bounded agent, right?
    - leogao 21 Apr 2025 6:45 UTC
      2 points
      0
      Parent
      is this for a reason other than the variance thing I mention?
      I think the thing I mention is still important is because it means there is no fundamental difference between positive and negative motivation. I agree that if everything was different degrees of extreme bliss then the variance would be so high that you never learn anything in practice. but if you shift everything slightly such that some mildly unpleasant things are now mildly pleasant, I claim this will make learning a bit faster or slower but still converge to the same thing.
      - Richard_Ngo 21 Apr 2025 20:42 UTC
        6 points
        2
        Parent
        Suppose you’re in a setting where the world is so large that you will only ever experience a tiny fraction of it directly, and you have to figure out the rest via generalization. Then your argument doesn’t hold up: shifting the mean might totally break your learning. But I claim that the real world is like this. So I am inherently skeptical of any result (like most convergence results) that rely on just trying approximately everything and gradually learning which to prefer and disprefer.
        leogao 23 Apr 2025 21:15 UTC
        8 points
        2
        Parent
        are you saying something like: you can’t actually do more of everything except one thing, because you’ll never do everything. so there’s a lot of variance that comes from exploration that multiplies with your $O (k^{2})$ variance from having a suboptimal zero point. so in practice your $k$ needs to be very close to optimal. so my thing is true but not useful in practice.
        i feel people do empirically shift $k$ quite a lot throughout life and it does seem to change how effectively they learn. if you’re mildly depressed your $k$ is slightly too low and you learn a little bit slower. if you’re mildly manic your $k$ is too high and you also learn a little bit slower. therapy, medications, and meditations shift $k$ mildly.
- Vanessa Kosoy 15 Apr 2025 19:10 UTC
  LW: 11 AF: 5
  0
  AF Parent
  In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it’s formalized via a loss function. Roughly speaking, the loss function expresses preferences over “which computations are running”. This means that you can have a “positive” preference for a particular computation to run or a “negative” preference for a particular computation not to run^[1].
  1. ^
    There are also more complicated possibilities, such as “if P runs then I want Q to run but if P doesn’t run then I rather that Q also doesn’t run” or even preferences that are only expressible in terms of entanglement between computations.
- Kaj_Sotala 13 Apr 2025 21:38 UTC
  8 points
  2
  Parent
  Reminds me of @MalcolmOcean ’s post on how awayness can’t aim (except maybe in 1D worlds) since it can only move away from things, and aiming at a target requires going toward something.
  Imagine trying to steer someone to stop in one exact spot. You can place a ❤ beacon they’ll move towards, or an X beacon they’ll move away from. (Reverse for pirates I guess.)
  In a hallway, you can kinda trap them in the middle of two Xs, or just put the ❤ in the exact spot.
  In an open field, you can maybe trap them in the middle of a bunch of XXXXs, but that’ll be hard because if you try to make a circle of X, and they’re starting outside it, they’ll probably just avoid it. If you get to move around, you can maybe kinda herd them to the right spot then close in, but it’s a lot of work.
  Or, you can just put the ❤ in the exact spot.
  For three dimensions, consider a helicopter or bird or some situation where there’s a height dimension as well. Now the X-based orientation is even harder because they can fly up to get away from the Xs, but with the ❤ you still just need one beacon for them to hone in on it.
- cubefox 13 Apr 2025 22:38 UTC
  5 points
  0
  Parent
  In Richard Jeffrey’s utility theory there is actually a very natural distinction between positive and negative motivations/desires. A plausible axiom is $U (⊤) = 0$ (the tautology has zero desirability: you already know it’s true). Which implies with the main axiom^[1] that the negation of any proposition with positive utility has negative utility, and vice versa. Which is intuitive: If something is good, its negation is bad, and the other way round. In particular, if $U (X) = U (\neg X)$ (indifference between $X$ and $\neg X$ ), then $U (X) = U (\neg X) = 0$ .
  
  More generally, $U (\neg X) = - (P (X) / P (\neg X)) U (X)$ . Which means that positive and negative utility of a proposition and it’s negation are scaled according to their relative odds. For example, while your lottery ticket winning the jackpot is obviously very good (large positive utility), having a losing ticket is clearly not very bad (small negative utility). Why? Because losing the lottery is very likely, far more likely than winning. Which means losing was already “priced in” to a large degree. If you learned that you indeed lost, that wouldn’t be a big update, so the “news value” is negative but not large in magnitude.
  
  Which means this utility theory has a zero point. Utility functions are therefore not invariant under adding an arbitrary constant. So the theory actually allows you to say $X$ is “twice as good” as $Y$ , “three times as bad”, “much better” etc. It’s a ratio scale.
  ↩︎
  If $P (X \land Y) = 0$ and $P (X \lor Y) \neq 0$ then $U (X \lor Y) = \frac{P (X) U (X) + P (Y) U (Y)}{P (X) + P (Y)} .$