johnswentworth comments on What Is The Alignment Problem?

johnswentworth 16 Jan 2025 19:11 UTC
5 points
0
“Is scratching your nose right now something you desire?” Yes. “Is scratching your nose right now something you value?” Not really, no.
I think disagree with this example. And I definitely disagree with it in the relevant-to-the-broader-context sense that, under a value-aligned sovereign AI, if my nose is itchy then it should get scratched, all else equal. It may not be a very important value, but it’s a value. (More generally, satisfying desires is itself a value, all else equal.)
I think “values”, as people use the term in everyday life, tends to be something more specific, where not only is the thing motivating, but it’s also motivating when you think about it in a self-reflective way. A.k.a. “X is motivating” AND “the-idea-of-myself-doing-X is motivating”. If I’m struggling to get out of bed, because I’m going to be late for work, then the feeling of my head remaining on the pillow is motivating, but the self-reflective idea of myself being in bed is demotivating. Consequently, I might describe the soft feeling of the pillow on my head as something I desire, but not something I value.
I do agree that that sort of reasoning is a pretty big central part of values. And it’s especially central for cases where I’m trying to distinguish my “real” values from cases where my reward stream is “inaccurately” sending signals for some reason, like e.g. heroin.
Here’s how I’d imagine that sort of thing showing up organically in a Value-RL-style system.
My brain is projecting all this value-structure out into the world; it’s modelling reward signals as being downstream of “values” which are magically attached to physical stuff/phenomena/etc. Insofar as X has high value, the projection-machinery expects both that X itself will produce a reward, and that the-idea-of-myself-doing-X will produce a reward (… as well as various things of similar flavor, like e.g. “my friends will look favorably on me doing X, thereby generating a reward”, “the idea of other people doing X will produce a reward”, etc). If those things come apart, then the brain goes “hmm, something fishy is going on here, I’m not sure these rewards are generated by my True Values; maybe my reward stream is compromised somehow, or maybe my social scene is reinforcing things which aren’t actually good for me, or...”.
That’s the sort of reasoning which should naturally show up in a Value RL style system capable of nontrivial model structure learning.
- Steven Byrnes 16 Jan 2025 21:49 UTC
  9 points
  0
  Parent
  under a value-aligned sovereign AI, if my nose is itchy then it should get scratched, all else equal
  Well, if the AI can make my nose not itch in the first place, I’m OK with that too. Whereas I wouldn’t make an analogous claim about things that I “value”, by my definition of “value”. If I really want to have children, I’m not OK with the AI removing my desire to have children, as a way to “solve” that “problem”. That’s more of a “value” and not just a desire.
  That’s the sort of reasoning which should naturally show up in a Value RL style system capable of nontrivial model structure learning.
  I’m not sure what point your making here. If human brains run on Value RL style systems (which I think I agree with), and humans in fact do that kind of reasoning, then tautologically, that kind of reasoning is a thing that can show up in Value RL style systems.
  Still, there’s a problem that it’s possible for some course-of-action to seem appealing when I think about it one way, and unappealing when I think about it a different way. Ego-dystonic desires like addictions are one example of that, but it also comes up in tons of normal situations like deciding what to eat. It’s a problem in the sense that it’s unclear what a “value-aligned” AI is supposed to be doing in that situation.
  - johnswentworth 16 Jan 2025 21:57 UTC
    5 points
    0
    Parent
    Cool, I think we agree here more than I thought based on the comment at top of chain. I think the discussion in our other thread is now better aimed than this one.