Steven Byrnes comments on What Is The Alignment Problem?

Steven Byrnes 16 Jan 2025 18:50 UTC
6 points
2
“Is scratching your nose right now something you desire?” Yes. “Is scratching your nose right now something you value?” Not really, no. But I claim that the Value Reinforcement Learning framework would assign a positive score to the idea of scratching my nose when it’s itchy. Otherwise, nobody would scratch their nose.
I desire peace and justice, but I also value peace and justice, so that’s not a good way to distinguish them.
(I suspect that you took my definition of “values” to be nearly synonymous with rewards or immediately anticipated rewards, which it very much is not; projected-upstream-generators-of-rewards are a quite different beast from rewards themselves, especially as we push further upstream.)
No, that’s not what I think. I think your definition points to whether things are motivating versus demotivating all-things-considered, including both immediate plans and long-term plans. And I want to call that desires. Desires can be long-term—e.g. “being a dad someday is something I very much desire”.
I think “values”, as people use the term in everyday life, tends to be something more specific, where not only is the thing motivating, but it’s also motivating when you think about it in a self-reflective way. A.k.a. “X is motivating” AND “the-idea-of-myself-doing-X is motivating”. If I’m struggling to get out of bed, because I’m going to be late for work, then the feeling of my head remaining on the pillow is motivating, but the self-reflective idea of myself being in bed is demotivating. Consequently, I might describe the soft feeling of the pillow on my head as something I desire, but not something I value.
(I talk about this in §8.4.2–8.5 here but that might be pretty hard to follow out of context.)
- johnswentworth 16 Jan 2025 19:11 UTC
  5 points
  0
  Parent
  “Is scratching your nose right now something you desire?” Yes. “Is scratching your nose right now something you value?” Not really, no.
  I think disagree with this example. And I definitely disagree with it in the relevant-to-the-broader-context sense that, under a value-aligned sovereign AI, if my nose is itchy then it should get scratched, all else equal. It may not be a very important value, but it’s a value. (More generally, satisfying desires is itself a value, all else equal.)
  I think “values”, as people use the term in everyday life, tends to be something more specific, where not only is the thing motivating, but it’s also motivating when you think about it in a self-reflective way. A.k.a. “X is motivating” AND “the-idea-of-myself-doing-X is motivating”. If I’m struggling to get out of bed, because I’m going to be late for work, then the feeling of my head remaining on the pillow is motivating, but the self-reflective idea of myself being in bed is demotivating. Consequently, I might describe the soft feeling of the pillow on my head as something I desire, but not something I value.
  I do agree that that sort of reasoning is a pretty big central part of values. And it’s especially central for cases where I’m trying to distinguish my “real” values from cases where my reward stream is “inaccurately” sending signals for some reason, like e.g. heroin.
  Here’s how I’d imagine that sort of thing showing up organically in a Value-RL-style system.
  My brain is projecting all this value-structure out into the world; it’s modelling reward signals as being downstream of “values” which are magically attached to physical stuff/phenomena/etc. Insofar as X has high value, the projection-machinery expects both that X itself will produce a reward, and that the-idea-of-myself-doing-X will produce a reward (… as well as various things of similar flavor, like e.g. “my friends will look favorably on me doing X, thereby generating a reward”, “the idea of other people doing X will produce a reward”, etc). If those things come apart, then the brain goes “hmm, something fishy is going on here, I’m not sure these rewards are generated by my True Values; maybe my reward stream is compromised somehow, or maybe my social scene is reinforcing things which aren’t actually good for me, or...”.
  That’s the sort of reasoning which should naturally show up in a Value RL style system capable of nontrivial model structure learning.
  - Steven Byrnes 16 Jan 2025 21:49 UTC
    9 points
    0
    Parent
    under a value-aligned sovereign AI, if my nose is itchy then it should get scratched, all else equal
    Well, if the AI can make my nose not itch in the first place, I’m OK with that too. Whereas I wouldn’t make an analogous claim about things that I “value”, by my definition of “value”. If I really want to have children, I’m not OK with the AI removing my desire to have children, as a way to “solve” that “problem”. That’s more of a “value” and not just a desire.
    That’s the sort of reasoning which should naturally show up in a Value RL style system capable of nontrivial model structure learning.
    I’m not sure what point your making here. If human brains run on Value RL style systems (which I think I agree with), and humans in fact do that kind of reasoning, then tautologically, that kind of reasoning is a thing that can show up in Value RL style systems.
    Still, there’s a problem that it’s possible for some course-of-action to seem appealing when I think about it one way, and unappealing when I think about it a different way. Ego-dystonic desires like addictions are one example of that, but it also comes up in tons of normal situations like deciding what to eat. It’s a problem in the sense that it’s unclear what a “value-aligned” AI is supposed to be doing in that situation.
    - johnswentworth 16 Jan 2025 21:57 UTC
      5 points
      0
      Parent
      Cool, I think we agree here more than I thought based on the comment at top of chain. I think the discussion in our other thread is now better aimed than this one.
- xpym 17 Jan 2025 10:21 UTC
  3 points
  2
  Parent
  
  the feeling of my head remaining on the pillow is motivating, but the self-reflective idea of myself being in bed is demotivating
  
  This seems to be an example of conflicting values, and its preferred resolution, not a difference between a value and a non-value. Suppose you’d find your pillow replaced by a wooden log—I’d imagine that the self-reflective idea of yourself remedying this state of affairs would be pretty motivating!
  - Steven Byrnes 17 Jan 2025 13:15 UTC
    4 points
    2
    Parent
    I claim that if you find someone who’s struggling to get out of bed, making groaning noises, and ask them the following question:
    Hey, I have a question about your values. The thing you’re doing right now, staying in bed past your alarm, in order to be more comfortable at the expense of probably missing your train and having to walk to work in the cold rain … is this thing you’re doing in accordance with your values?
    I bet the person says “no”. Yet, they’re still in fact doing that thing, which implies (tautologically) that they have some desire to do it—I mean, they’re not doing it “by accident”! So it’s conflicting desires, not conflicting values.
    I don’t think your wooden log example is relevant. Insofar as different values are conflicting, that conflict has already long ago been resolved, and the resolution is: the action which best accords with the person’s values, in this instance, is to get up. And yet, they’re still horizontal.
    Another example: if someone says “I want to act in accordance with my values” or “I don’t always act in accordance with my values”, we recognize these as two substantive claims. The first is not a tautology, and the second is not a self-contradiction.
    - xpym 17 Jan 2025 14:47 UTC
      1 point
      0
      Parent
      
      I bet the person says “no”.
      
      I agree, but I think it’s important to mention issues like social desirability bias and strategic self-deception here, coupled with the fact that most people just aren’t particularly good at introspection.
      
      it’s conflicting desires, not conflicting values
      
      It’s both, our minds employ desires in service of pursuing our (often conflicting) values.
      
      Insofar as different values are conflicting, that conflict has already long ago been resolved, and the resolution is: the action which best accords with the person’s values, in this instance, is to get up.
      
      I’d rather put it as a routine conflict eventually getting resolved in a predictable way.
      
      Another example: if someone says “I want to act in accordance with my values” or “I don’t always act in accordance with my values”, we recognize these as two substantive claims. The first is not a tautology, and the second is not a self-contradiction.
      
      Indeed, but I claim that those statements actually mean “I want my value conflicts to resolve in the way I endorse” and “I don’t always endorse the way my value conflicts resolve”.